Collaborators: Adiah Amadou, Ryan Kemajou
A: Project idea - 5%
B: Dataset Curation and Preprocessing - 10%
C: Data Exploration and Summary Statistics - 10%
D: ML Algorithm Design/Development - 25%
E: ML Algorithm Training and Test Data Analysis - 20%
F: Visualization, Result Analysis, Conclusion - 15%
G: Final Tutorial Report Creation - 10%
H: Additional (not listed above, if any) - 5%
Member 1: Adiah Amadou, Contribution: 100%.
Member 2: Ryan Kemajou, Contribution: 100%.
Adiah Amadou: Throughout the project, my contributions spanned various crucial aspects. Firstly, I collaborated on crafting the introduction, ensuring clarity and relevance to set the stage for our work. Subsequently, I delved into data cleaning and exploratory analysis, employing techniques such as linear regression and mean mode median calculations to derive insights. My involvement extended to the meticulous formatting, layout, and typing of the final project, ensuring its presentation was professional and cohesive. Additionally, I played a pivotal role in refining the machine learning model, troubleshooting bugs, and incorporating features to enhance accuracy. Prior to implementation, I participated in researching the project idea and curated valuable sources to underpin our work. Furthermore, my contributions extended beyond technical tasks; I facilitated insightful conclusions and insights at each juncture, leveraging my understanding of the project's nuances. Moreover, I actively engaged in fostering a conducive workflow by posing pertinent questions aimed at optimizing outcomes. Overall, my multifaceted involvement contributed significantly to the project's success, amalgamating technical proficiency with strategic thinking.
Ryan Kemajou: In the recent data science project, I assisted in every stage of the project lifecycle. I contributed to the identification of the project theme, ensuring it was relevant and impactful. I conducted extensive research to locate valuable datasets that would serve as the backbone of our analysis. I meticulously cleaned a portion of the data before integrating it into our shared OneDrive. Once the datasets were distributed among the team, I further cleaned my assigned portion, preparing it for analysis. I proposed insightful questions for analysis post data cleaning and suggested effective ways to visualize our findings. I joined in brainstorming sessions to determine the most suitable statistical tests for our project. I assisted in exploring the data, uncovering patterns and insights that could guide our project. I actively participated in discussions on selecting the most appropriate model for our project and provided supplementary materials such as articles and youtube videos to enhance the team's understanding of the task more specifically the use of random forest classification. Lastly, I helped ensure our final tutorial was well-documented with clear and concise comments, enhancing its readability and understandability.
In the competitive world of professional basketball, the ability to predict future outcomes isn't just about entertainment; it's about gaining a strategic advantage. Imagine coaches being able to forecast which NBA team(s) will dominate the next season. With such insights, coaches can fine-tune their training regimes, refine game strategies, and make informed decisions about player acquisitions. Players themselves can set more targeted performance goals, aiming for the stats that truly matter. Fans and bettors, too, can benefit greatly from such predictions, knowing which teams to support and why.
Harnessing the power of advanced analytics and historical data, we aim to develop predictive models that not only forecast the outcomes of the NBA season but also empower coaches, players, fans, and bettors alike. Through comprehensive analysis of diverse datasets encompassing team and player statistics, awards, opponent performance, and injury reports, we seek to uncover hidden patterns and insights that illuminate the path to success in the NBA.
For this project, datasets chosen englobe NBA data from 1947 to present. These datasets were obtained from Kaggle, and Basketball-Reference.com, reliable websites for basketballdata
Our datasets contains more than 30,000 rows and over 30 columns with statistics encompassing the following;
Player Season Stats: Individual player statistics including points per game, rebounds, assists, steals, blocks, shooting percentages, and other relevant metrics.
Awards (Players) Stats: Historical data on awards won by players, such as MVP awards, Defensive Player of the Year, Rookie of the Year, and other accolades.
Opponent Team Stats: Statistics related to opponents faced by each team, including their performance metrics, strength of schedule, and head-to-head matchups.
Player Injuries: Information on player injuries and their impact on team performance throughout the season.
By assembling these datasets from trusted sources, we ensure the reliability and accuracy of the data used in our predictive models. This comprehensive approach allows us to analyze a wide range of factors that contribute to NBA team performance, enabling us to develop robust and insightful predictions for the upcoming season.
We decided to keep all of the files (mainly CSVs) relevant to this project in a drive so we mounted here.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
It's essential to bring in certain libraries so below are the libraries that will accompany us on our journey through this project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
import scipy.stats as stats
import statsmodels.api as sm
import math
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder

We opted for a step-by-step approach in handling our dataframes. Below, we began by examining the CSV files related to players from each NBA team. Additionally, we consolidated the dataframes into a single one, named "player_df."
df_totals = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Totals.csv')
df_shooting = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Shooting.csv')
df_play_by_play = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Play By Play.csv')
df_per_game = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Per Game.csv')
df_per_100_poss = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Per 100 Poss.csv')
df_per_36_minutes = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Per 36 Minutes.csv')
m1 = pd.merge(df_totals, df_shooting, on =['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'POSITION', 'AGE', 'LEAGUE', 'TEAM_ABBREVIATION', 'MINUTES_PLAYED'], how ='left')
m2 = pd.merge(m1, df_play_by_play, on =['SEASON_ID', 'SEASON', 'PLAYER', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE','GAMES', 'MINUTES_PLAYED'], how ='left')
m3 = pd.merge(m2, df_per_game, on =['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES', 'FREE_THROW_PERCENTAGE', 'FIELD_GOAL_PERCENT'], how ='left')
m4 = pd.merge(m3, df_per_100_poss, on =['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'AGE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROW_PERCENTAGE', 'YEARS_OF_EXPERIENCE'], how ='left')
player_df = pd.merge(m4, df_per_36_minutes, on =['SEASON_ID','POSITION', 'AGE', 'LEAGUE', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROW_PERCENTAGE', 'YEARS_OF_EXPERIENCE', 'TEAM'], how ='left')
player_df.describe()
| SEASON_ID | SEASON | PLAYER_ID | BIRTH_YEAR | AGE | EXPERIENCE | GAMES | GAMES_STARTED | MINUTES_PLAYED | FIELD_GOALS_MADE | ... | FREETHROW_ATTEMPTED_PER_36_MIN | OFFENSIVE_REBOUND_PER_36_MIN | DEFENSIVE_REBOUND_PER_36_MIN | TOTAL_REBOUND_PER_36_MIN | ASSIST_PER_36_MIN | STEALS_PER_36_MIN | BLOCKS_PER_36_MIN | TURNOVER_PER_36_MIN | PERSONAL_FOULS_PER_36_MIN | POINTS_PER_36_MIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 31787.00000 | 31787.000000 | 31787.000000 | 2870.000000 | 31765.000000 | 31787.000000 | 31787.000000 | 23150.000000 | 30704.000000 | 31787.000000 | ... | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 | 16697.000000 |
| mean | 15894.00000 | 1994.973008 | 2646.740114 | 1944.139721 | 26.486164 | 4.967691 | 48.607796 | 22.406177 | 1158.959386 | 187.590965 | ... | 3.182793 | 1.779745 | 4.587794 | 6.367036 | 2.965131 | 1.124837 | 0.730017 | 2.064868 | 3.598341 | 13.326556 |
| std | 9176.26084 | 20.403059 | 1301.138515 | 15.162153 | 3.844564 | 3.723561 | 26.616961 | 27.814887 | 926.420014 | 184.565391 | ... | 2.184569 | 1.633455 | 2.274799 | 3.304136 | 2.144295 | 0.776187 | 0.872100 | 1.305920 | 1.894635 | 5.179310 |
| min | 1.00000 | 1947.000000 | 1.000000 | 1914.000000 | 18.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 7947.50000 | 1979.000000 | 1625.000000 | 1938.000000 | 24.000000 | 2.000000 | 25.000000 | 0.000000 | 308.000000 | 37.000000 | ... | 1.800000 | 0.700000 | 3.100000 | 3.900000 | 1.500000 | 0.700000 | 0.200000 | 1.400000 | 2.500000 | 10.200000 |
| 50% | 15894.00000 | 1998.000000 | 2733.000000 | 1946.000000 | 26.000000 | 4.000000 | 53.000000 | 7.000000 | 986.000000 | 132.000000 | ... | 2.900000 | 1.300000 | 4.200000 | 5.700000 | 2.300000 | 1.000000 | 0.500000 | 1.900000 | 3.300000 | 13.100000 |
| 75% | 23840.50000 | 2012.000000 | 3718.000000 | 1951.000000 | 29.000000 | 7.000000 | 74.000000 | 42.000000 | 1882.000000 | 286.000000 | ... | 4.200000 | 2.600000 | 5.800000 | 8.400000 | 4.000000 | 1.400000 | 1.000000 | 2.500000 | 4.300000 | 16.200000 |
| max | 31787.00000 | 2024.000000 | 5197.000000 | 2002.000000 | 46.000000 | 22.000000 | 90.000000 | 83.000000 | 3882.000000 | 1597.000000 | ... | 36.000000 | 36.000000 | 72.000000 | 72.000000 | 36.000000 | 18.000000 | 36.000000 | 36.000000 | 72.000000 | 108.000000 |
8 rows × 126 columns
player_df.columns.tolist()
['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'BIRTH_YEAR', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOALS_AGAINST', 'FIELD_GOAL_PERCENTAGE_x', 'THREE_POINTERS_MADE', 'THREE_POINTERS_ATTEMPTED', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_ATTEMPTED', 'TWO_POINTERS_PERCENTAGE', 'EFFECTIVE_FIELD_GOAL_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTED', 'FREE_THROW_PERCENTAGE', 'OFFENSIVE_REBOUND', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUND', 'ASSISTS', 'STEALS', 'BLOCK', 'TURNOVER', 'PERSONAL_FOUL', 'TOTAL_POINTS', 'YEARS_OF_EXPERIENCE', 'GAMES_PLAYED', 'FIELD_GOAL_PERCENT', 'AVERAGE_DISTANCE_OF_FIELD_GOALS_ATTEMPTED', 'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_TWO_POINT_RANGE', 'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_THREE_POINT_RANGE', 'FIELD_GOAL_PERCENTAGE_FROM_TWO_POINT_RANGE', 'FIELD_GOAL_PERCENTAGE_FROM_THREE_POINT_RANGE', 'PERCENT_ASSISTED_TWO_POINT_FIELD_GOAL', 'PERCENT_ASSISTED_THREE_POINT_FIELD_GOAL', 'PERCENTAGE_DUNKS_OF_FIELD_GOAL_ATTEMPTED', 'NUMBER_OF_DUNKS', 'PERCENTAGE_CORNER_THREE_OF_THREE_POINTER_ATTEMPTED', 'CORNER_THREE_POINT_PERCENT', 'NUMBER_HEAVES_ATTEMPTED', 'NUMBER_HEAVES_MADE', 'PLAYED_ID', 'TEAM', 'POINTS_PER_GAME_PERCENT', 'SHOOTING_GUARD_PERCENT', 'SMALL_FORWARD_PERCENT', 'POWER_FORWARD_PERCENT', 'CENTER_PERCENT', 'ON_COURT_PLUS_MINUES_PER_100_POSSESION', 'NET_PLUS_MINUES_PER_100_POSSESION', 'BAD_PASS_TURNOVER', 'LOST_BALL_TURNOVER', 'SHOOTING_FOUL_COMMITTED', 'OFFENSIVE_FOUL_COMMITTED', 'SHOOTING_FOUL_DRAWN', 'OFFENSIVE_FOUL_DRAWN', 'POINTS_GENERATED_BY_ASSISTS', 'AND1S', 'FIELD_GOAL_ATTEMPTED_BLOCKED', 'gs', 'MINUTES_PLAYED_PER_GAME', 'FIELD_GOAL_PER_GAME', 'FIELD_GOAL_ATTEMPTED_PER_GAME', 'THREE_POINTERS_PER_GAME', 'THREE_POINT_ATTEMPTED_PER_GAME', 'THREE_POINT_PERCENTAGE', 'TWO_POINTER_PER_GAME', 'TWO_POINTER_ATTEMPTED_PER_GAME', 'TWO_POINT_PERCENTAGE', 'e_fg_percent', 'FREE_THROW_PER_GAME', 'FREE_THROW_ATTEMPTED_PER_GAME', 'OFFENSIVE_REBOUND_PER_GAME', 'DEFENSIVE_REBOUND_PER_GAME', 'TOTAL_REBOUND_PER_GAME', 'ASSIST_PER_GAME', 'STEAL_PER_GAME', 'BLOCKS_PER_GAME', 'TURNOVERS_PER_GAME', 'PERSONAL_FOUL_PER_GAME', 'POINTS_PER_GAME', 'POSITINON', 'FIELD_GOALS_PER_100_POSSESION', 'FIELD_GOALS_ATTEMPTED_PER_100_POSSESION', 'FIELD_GOALS_PERCENTAGE', 'THREE_POINTERS_PER_100_POSSESION', 'THREE_POINTERS_ATTEMPTED_PER_100_POSSESION', 'TWO_POINTERS_PER_100_POSSESION', 'TWO_POINTERS_ATTEMPTED_PER_100_POSSESION', 'FREE_THROWS_PER_100_POSSESION', 'FREE_THROWS_ATTEMPTED_PER_100_POSSESION', 'OFFENSIVE_REBOUND_PER_100_POSSESION', 'DEFENSIVE_REBOUND_PER_100_POSSESION', 'TOTAL_REBOUNDS_PER_100_POSSESION', 'ASSIST_PER_100_POSSESION', 'STEALS_PER_100_POSSESION', 'BLOCKS_PER_100_POSSESION', 'TURNOVER_PER_100_POSSESION', 'PERSONAL_FOULS_PER_100_POSSESION', 'POINTS_PER_100_POSSESION', 'OFFESNIVE_Rating', 'DEFENSIVE RATING', 'SEAONS', 'PLAYR_ID', 'PLAYER_NAME', 'FIELD_GOALS_PER_36_MINUTES', 'FIELD_GOALS_ATTEMPTED_PER_GAME', 'FIELD_GOAL_PERCENTAGE_y', 'THREE_POINTERS_PER_36_MIN', 'THREE_POINTERS_ATTEMPTED_PER_36_MIN', 'TWO_POINTERS_PER_36_MIN', 'TWO_POINTERS_ATTEMPTED_PER_36_MIN', 'FREETHROW_PER_36_MIN', 'FREETHROW_ATTEMPTED_PER_36_MIN', 'OFFENSIVE_REBOUND_PER_36_MIN', 'DEFENSIVE_REBOUND_PER_36_MIN', 'TOTAL_REBOUND_PER_36_MIN', 'ASSIST_PER_36_MIN', 'STEALS_PER_36_MIN', 'BLOCKS_PER_36_MIN', 'TURNOVER_PER_36_MIN', 'PERSONAL_FOULS_PER_36_MIN', 'POINTS_PER_36_MIN']
This function will help us identify any columns that are duplicated in our unified player dataframe
def common_columns(dataframes):
# Create a dictionary to count the occurrences of each column
column_counts = {}
# Iterate over all dataframes
for df in dataframes:
# Iterate over all columns in the current dataframe
for col in df.columns:
# If the column is already in the dictionary, increment its count
if col in column_counts:
column_counts[col] += 1
# Otherwise, add the column to the dictionary with a count of 1
else:
column_counts[col] = 1
# Return a list of columns that appear at least twice
return [col for col, count in column_counts.items() if count >= 2]
dfs=[df_totals,
df_shooting ,
df_play_by_play,
df_per_game,
df_per_100_poss,
df_per_36_minutes]
# dfs is your list of dataframes
common_cols = common_columns(dfs)
print(common_cols)
['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'FIELD_GOAL_PERCENTAGE', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROW_PERCENTAGE', 'YEARS_OF_EXPERIENCE', 'FIELD_GOAL_PERCENT', 'TEAM']
player_df =player_df.drop([
'PLAYER_NAME','POSITINON','PLAYED_ID','GAMES_PLAYED', 'SEASON_ID',
'BIRTH_YEAR',
'POSITION',
'LEAGUE',
'THREE_POINTERS_ATTEMPTED',
'TWO_POINTERS_ATTEMPTED',
'YEARS_OF_EXPERIENCE',
'GAMES_PLAYED',
'FIELD_GOAL_PERCENT',
'AVERAGE_DISTANCE_OF_FIELD_GOALS_ATTEMPTED',
'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_TWO_POINT_RANGE',
'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_THREE_POINT_RANGE',
'FIELD_GOAL_PERCENTAGE_FROM_TWO_POINT_RANGE',
'FIELD_GOAL_PERCENTAGE_FROM_THREE_POINT_RANGE',
'PERCENT_ASSISTED_TWO_POINT_FIELD_GOAL',
'PERCENT_ASSISTED_THREE_POINT_FIELD_GOAL',
'PERCENTAGE_DUNKS_OF_FIELD_GOAL_ATTEMPTED',
'NUMBER_OF_DUNKS',
'PERCENTAGE_CORNER_THREE_OF_THREE_POINTER_ATTEMPTED',
'CORNER_THREE_POINT_PERCENT',
'NUMBER_HEAVES_ATTEMPTED',
'NUMBER_HEAVES_MADE',
'PLAYED_ID',
'POINTS_PER_GAME_PERCENT',
'SHOOTING_GUARD_PERCENT',
'SMALL_FORWARD_PERCENT',
'POWER_FORWARD_PERCENT',
'CENTER_PERCENT',
'ON_COURT_PLUS_MINUES_PER_100_POSSESION',
'NET_PLUS_MINUES_PER_100_POSSESION',
'BAD_PASS_TURNOVER',
'LOST_BALL_TURNOVER',
'SHOOTING_FOUL_COMMITTED',
'OFFENSIVE_FOUL_COMMITTED',
'SHOOTING_FOUL_DRAWN',
'OFFENSIVE_FOUL_DRAWN',
'POINTS_GENERATED_BY_ASSISTS',
'AND1S',
'FIELD_GOAL_ATTEMPTED_BLOCKED',
'gs',
'MINUTES_PLAYED_PER_GAME',
'FIELD_GOAL_PER_GAME',
'FIELD_GOAL_ATTEMPTED_PER_GAME',
'THREE_POINTERS_PER_GAME',
'THREE_POINT_ATTEMPTED_PER_GAME',
'THREE_POINT_PERCENTAGE',
'TWO_POINTER_PER_GAME',
'TWO_POINTER_ATTEMPTED_PER_GAME',
'TWO_POINT_PERCENTAGE',
'e_fg_percent',
'FREE_THROW_PER_GAME',
'FREE_THROW_ATTEMPTED_PER_GAME',
'OFFENSIVE_REBOUND_PER_GAME',
'DEFENSIVE_REBOUND_PER_GAME',
'TOTAL_REBOUND_PER_GAME',
'ASSIST_PER_GAME',
'STEAL_PER_GAME',
'BLOCKS_PER_GAME',
'TURNOVERS_PER_GAME',
'PERSONAL_FOUL_PER_GAME',
'POINTS_PER_GAME',
'POSITINON',
'FIELD_GOALS_PER_100_POSSESION',
'FIELD_GOALS_ATTEMPTED_PER_100_POSSESION',
'FIELD_GOALS_PERCENTAGE',
'THREE_POINTERS_PER_100_POSSESION',
'THREE_POINTERS_ATTEMPTED_PER_100_POSSESION',
'TWO_POINTERS_PER_100_POSSESION',
'TWO_POINTERS_ATTEMPTED_PER_100_POSSESION',
'FREE_THROWS_PER_100_POSSESION',
'FREE_THROWS_ATTEMPTED_PER_100_POSSESION',
'OFFENSIVE_REBOUND_PER_100_POSSESION',
'DEFENSIVE_REBOUND_PER_100_POSSESION',
'TOTAL_REBOUNDS_PER_100_POSSESION',
'ASSIST_PER_100_POSSESION',
'STEALS_PER_100_POSSESION',
'BLOCKS_PER_100_POSSESION',
'TURNOVER_PER_100_POSSESION',
'PERSONAL_FOULS_PER_100_POSSESION',
'POINTS_PER_100_POSSESION',
'SEAONS',
'PLAYR_ID',
'PLAYER_NAME',
'FIELD_GOALS_PER_36_MINUTES',
'FIELD_GOALS_ATTEMPTED_PER_GAME',
'FIELD_GOAL_PERCENTAGE_y',
'THREE_POINTERS_PER_36_MIN',
'THREE_POINTERS_ATTEMPTED_PER_36_MIN',
'TWO_POINTERS_PER_36_MIN',
'TWO_POINTERS_ATTEMPTED_PER_36_MIN',
'FREETHROW_PER_36_MIN',
'FREETHROW_ATTEMPTED_PER_36_MIN',
'OFFENSIVE_REBOUND_PER_36_MIN',
'DEFENSIVE_REBOUND_PER_36_MIN',
'TOTAL_REBOUND_PER_36_MIN',
'ASSIST_PER_36_MIN',
'STEALS_PER_36_MIN',
'BLOCKS_PER_36_MIN',
'TURNOVER_PER_36_MIN',
'PERSONAL_FOULS_PER_36_MIN',
'POINTS_PER_36_MIN'], axis =1)
player_df.describe()
| SEASON | PLAYER_ID | AGE | EXPERIENCE | GAMES | GAMES_STARTED | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOALS_AGAINST | FIELD_GOAL_PERCENTAGE_x | ... | DEFENSIVE_REBOUND | TOTAL_REBOUND | ASSISTS | STEALS | BLOCK | TURNOVER | PERSONAL_FOUL | TOTAL_POINTS | OFFESNIVE_Rating | DEFENSIVE RATING | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 31787.000000 | 31787.000000 | 31765.000000 | 31787.000000 | 31787.000000 | 23150.000000 | 30704.000000 | 31787.000000 | 31787.000000 | 31628.000000 | ... | 27130.000000 | 30893.000000 | 31787.000000 | 26161.000000 | 26162.000000 | 26152.000000 | 31787.000000 | 31787.000000 | 16635.000000 | 16697.000000 |
| mean | 1994.973008 | 2646.740114 | 26.486164 | 4.967691 | 48.607796 | 22.406177 | 1158.959386 | 187.590965 | 414.794035 | 0.429751 | ... | 144.706377 | 216.707345 | 109.247711 | 37.799014 | 23.282547 | 70.743691 | 109.857961 | 492.135433 | 103.039615 | 107.991316 |
| std | 20.403059 | 1301.138515 | 3.844564 | 3.723561 | 26.616961 | 27.814887 | 926.420014 | 184.565391 | 391.133337 | 0.101668 | ... | 146.348383 | 224.381109 | 130.954467 | 37.430130 | 35.281323 | 67.437942 | 83.664565 | 485.470475 | 18.632958 | 5.670590 |
| min | 1947.000000 | 1.000000 | 18.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 33.000000 |
| 25% | 1979.000000 | 1625.000000 | 24.000000 | 2.000000 | 25.000000 | 0.000000 | 308.000000 | 37.000000 | 90.000000 | 0.390000 | ... | 32.000000 | 48.000000 | 17.000000 | 8.000000 | 3.000000 | 16.000000 | 34.000000 | 98.000000 | 97.000000 | 104.000000 |
| 50% | 1998.000000 | 2733.000000 | 26.000000 | 4.000000 | 53.000000 | 7.000000 | 986.000000 | 132.000000 | 302.000000 | 0.438000 | ... | 103.000000 | 152.000000 | 64.000000 | 27.000000 | 11.000000 | 51.000000 | 98.000000 | 343.000000 | 105.000000 | 108.000000 |
| 75% | 2012.000000 | 3718.000000 | 29.000000 | 7.000000 | 74.000000 | 42.000000 | 1882.000000 | 286.000000 | 637.000000 | 0.481000 | ... | 209.000000 | 308.000000 | 152.000000 | 56.000000 | 28.000000 | 107.000000 | 173.000000 | 750.000000 | 112.000000 | 112.000000 |
| max | 2024.000000 | 5197.000000 | 46.000000 | 22.000000 | 90.000000 | 83.000000 | 3882.000000 | 1597.000000 | 3159.000000 | 1.000000 | ... | 1111.000000 | 2149.000000 | 1164.000000 | 346.000000 | 456.000000 | 464.000000 | 386.000000 | 4029.000000 | 300.000000 | 130.000000 |
8 rows × 29 columns
player_df.shape
(31787, 32)
player_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 31787 entries, 0 to 31786 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 31787 non-null int64 1 PLAYER_ID 31787 non-null int64 2 PLAYER 31787 non-null object 3 AGE 31765 non-null float64 4 EXPERIENCE 31787 non-null int64 5 TEAM_ABBREVIATION 31787 non-null object 6 GAMES 31787 non-null int64 7 GAMES_STARTED 23150 non-null float64 8 MINUTES_PLAYED 30704 non-null float64 9 FIELD_GOALS_MADE 31787 non-null int64 10 FIELD_GOALS_AGAINST 31787 non-null int64 11 FIELD_GOAL_PERCENTAGE_x 31628 non-null float64 12 THREE_POINTERS_MADE 25435 non-null float64 13 THREE_POINTERS_PERCENTAGE 21250 non-null float64 14 TWO_POINTERS_MADE 31787 non-null int64 15 TWO_POINTERS_PERCENTAGE 31540 non-null float64 16 EFFECTIVE_FIELD_GOAL_PERCENTAGE 31628 non-null float64 17 FREE_THROWS_MADE 31787 non-null int64 18 FREE_THROW_ATTEMPTED 31787 non-null int64 19 FREE_THROW_PERCENTAGE 30488 non-null float64 20 OFFENSIVE_REBOUND 27130 non-null float64 21 DEFENSIVE_REBOUND 27130 non-null float64 22 TOTAL_REBOUND 30893 non-null float64 23 ASSISTS 31787 non-null int64 24 STEALS 26161 non-null float64 25 BLOCK 26162 non-null float64 26 TURNOVER 26152 non-null float64 27 PERSONAL_FOUL 31787 non-null int64 28 TOTAL_POINTS 31787 non-null int64 29 TEAM 16702 non-null object 30 OFFESNIVE_Rating 16635 non-null float64 31 DEFENSIVE RATING 16697 non-null float64 dtypes: float64(17), int64(12), object(3) memory usage: 7.8+ MB
player_df.head()
| SEASON | PLAYER_ID | PLAYER | AGE | EXPERIENCE | TEAM_ABBREVIATION | GAMES | GAMES_STARTED | MINUTES_PLAYED | FIELD_GOALS_MADE | ... | TOTAL_REBOUND | ASSISTS | STEALS | BLOCK | TURNOVER | PERSONAL_FOUL | TOTAL_POINTS | TEAM | OFFESNIVE_Rating | DEFENSIVE RATING | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024 | 5025 | A.J. Green | 24.0 | 2 | MIL | 36 | 0.0 | 335.0 | 52 | ... | 38.0 | 21 | 3.0 | 2.0 | 4.0 | 34 | 156 | MIL | 128.0 | 121.0 |
| 1 | 2024 | 5026 | A.J. Lawson | 23.0 | 2 | DAL | 27 | 0.0 | 230.0 | 39 | ... | 32.0 | 13 | 9.0 | 3.0 | 10.0 | 19 | 103 | DAL | 110.0 | 117.0 |
| 2 | 2024 | 5027 | AJ Griffin | 20.0 | 2 | ATL | 18 | 0.0 | 132.0 | 13 | ... | 14.0 | 4 | 1.0 | 1.0 | 6.0 | 6 | 37 | ATL | 82.0 | 126.0 |
| 3 | 2024 | 4219 | Aaron Gordon | 28.0 | 10 | DEN | 49 | 49.0 | 1555.0 | 266 | ... | 327.0 | 150 | 37.0 | 34.0 | 68.0 | 90 | 677 | DEN | 123.0 | 116.0 |
| 4 | 2024 | 4582 | Aaron Holiday | 27.0 | 6 | HOU | 51 | 1.0 | 913.0 | 134 | ... | 95.0 | 96 | 28.0 | 4.0 | 40.0 | 83 | 370 | HOU | 118.0 | 115.0 |
5 rows × 32 columns
player_df.isnull().sum().tolist()
[0, 0, 0, 22, 0, 0, 0, 8637, 1083, 0, 0, 159, 6352, 10537, 0, 247, 159, 0, 0, 1299, 4657, 4657, 894, 0, 5626, 5625, 5635, 0, 0, 15085, 15152, 15090]
player_df['SEASON'].nunique()
78
player_df['PLAYER_ID'].nunique()
5197
player_df = player_df.rename(columns={'FIELD_GOAL_PERCENTAGE_x': 'FIELD_GOAL_PERCENTAGE'})
player_df = player_df.rename(columns={'FIELD_GOALS_AGAINST': 'FIELD_GOAL_ATTEMPTED'})
player_df = player_df.rename(columns={'OFFESNIVE_Rating': 'OFFENSIVE_RATING'})
player_df.describe()
| SEASON | PLAYER_ID | AGE | EXPERIENCE | GAMES | GAMES_STARTED | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOAL_ATTEMPTED | FIELD_GOAL_PERCENTAGE | ... | DEFENSIVE_REBOUND | TOTAL_REBOUND | ASSISTS | STEALS | BLOCK | TURNOVER | PERSONAL_FOUL | TOTAL_POINTS | OFFENSIVE_RATING | DEFENSIVE RATING | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 31787.000000 | 31787.000000 | 31765.000000 | 31787.000000 | 31787.000000 | 23150.000000 | 30704.000000 | 31787.000000 | 31787.000000 | 31628.000000 | ... | 27130.000000 | 30893.000000 | 31787.000000 | 26161.000000 | 26162.000000 | 26152.000000 | 31787.000000 | 31787.000000 | 16635.000000 | 16697.000000 |
| mean | 1994.973008 | 2646.740114 | 26.486164 | 4.967691 | 48.607796 | 22.406177 | 1158.959386 | 187.590965 | 414.794035 | 0.429751 | ... | 144.706377 | 216.707345 | 109.247711 | 37.799014 | 23.282547 | 70.743691 | 109.857961 | 492.135433 | 103.039615 | 107.991316 |
| std | 20.403059 | 1301.138515 | 3.844564 | 3.723561 | 26.616961 | 27.814887 | 926.420014 | 184.565391 | 391.133337 | 0.101668 | ... | 146.348383 | 224.381109 | 130.954467 | 37.430130 | 35.281323 | 67.437942 | 83.664565 | 485.470475 | 18.632958 | 5.670590 |
| min | 1947.000000 | 1.000000 | 18.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 33.000000 |
| 25% | 1979.000000 | 1625.000000 | 24.000000 | 2.000000 | 25.000000 | 0.000000 | 308.000000 | 37.000000 | 90.000000 | 0.390000 | ... | 32.000000 | 48.000000 | 17.000000 | 8.000000 | 3.000000 | 16.000000 | 34.000000 | 98.000000 | 97.000000 | 104.000000 |
| 50% | 1998.000000 | 2733.000000 | 26.000000 | 4.000000 | 53.000000 | 7.000000 | 986.000000 | 132.000000 | 302.000000 | 0.438000 | ... | 103.000000 | 152.000000 | 64.000000 | 27.000000 | 11.000000 | 51.000000 | 98.000000 | 343.000000 | 105.000000 | 108.000000 |
| 75% | 2012.000000 | 3718.000000 | 29.000000 | 7.000000 | 74.000000 | 42.000000 | 1882.000000 | 286.000000 | 637.000000 | 0.481000 | ... | 209.000000 | 308.000000 | 152.000000 | 56.000000 | 28.000000 | 107.000000 | 173.000000 | 750.000000 | 112.000000 | 112.000000 |
| max | 2024.000000 | 5197.000000 | 46.000000 | 22.000000 | 90.000000 | 83.000000 | 3882.000000 | 1597.000000 | 3159.000000 | 1.000000 | ... | 1111.000000 | 2149.000000 | 1164.000000 | 346.000000 | 456.000000 | 464.000000 | 386.000000 | 4029.000000 | 300.000000 | 130.000000 |
8 rows × 29 columns
Given that some of data from different columns was absent, we decided to fill them with teh mean. Using just teh mean of the whole dataframe would not be fit as we would in that case, be assigning the mean of players from different teams to players of other teams. So we decided to assign use the mean of other players within the same team and teh same year.
cols_to_fill = ['THREE_POINTERS_MADE','THREE_POINTERS_PERCENTAGE','STEALS','BLOCK','TURNOVER','DEFENSIVE_REBOUND','OFFENSIVE_REBOUND']
for col in cols_to_fill:
player_df[col] = player_df.groupby(['TEAM_ABBREVIATION', 'SEASON'])[col].transform(lambda x: x.fillna(x.mean()))
player_df['AGE'].hist()
<Axes: >
How many teams do we even have?
player_df['TEAM'].unique().tolist()
['MIL', 'DAL', 'ATL', 'DEN', 'HOU', 'IND', 'OKC', 'CHI', 'ORL', 'BOS', 'TOT', 'DET', 'NYK', 'LAL', 'SAC', 'MIA', 'CHO', 'LAC', 'GSW', 'POR', 'MIN', 'WAS', 'BRK', 'MEM', 'SAS', 'PHO', 'NOP', 'UTA', 'TOR', 'PHI', 'CLE', 'CHA', 'NOH', 'NJN', 'SEA', 'NOK', 'CHH', 'VAN', 'WSB', nan]
There's a rumour going round that younger players perform better. We just wanted to test it...
correlation = player_df['AGE'].corr(player_df['FIELD_GOALS_MADE'])
print(correlation)
0.027280815763253866
plt.figure(figsize=(10,8))
sns.heatmap(player_df[['AGE', 'FIELD_GOALS_MADE']].corr(), annot=True, cmap='coolwarm')
plt.show()
df_opponent_stats_100=pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Opponent Stats Per 100 Poss.csv')
df_opponents_s_per_game=pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Opponent Stats Per Game.csv')
df_opponent_totals = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Opponent Totals.csv')
df_opponent_stats_100.shape
(1402, 28)
df_opponents_s_per_game.shape
(1845, 28)
df_opponent_totals.shape
(1845, 28)
df_opponent_stats_100.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1402 entries, 0 to 1401 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1402 non-null int64 1 LEAGUE 1402 non-null object 2 TEAM_NAME 1402 non-null object 3 TEAM_ABBREVIATION 1402 non-null object 4 PLAYOFFS 1402 non-null bool 5 GAMES 1402 non-null int64 6 MINUTES_PLAYED 1402 non-null int64 7 OPPONENT_FIELD_GOALS_PER_100_POSSESION 1402 non-null float64 8 OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION 1402 non-null float64 9 OPPONENT_FIELD_GOALS_PERCENTAGE 1402 non-null float64 10 OPPONENT_THREE_POINTERS_PER_100_POSSESION 1283 non-null float64 11 OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION 1283 non-null float64 12 OPPONENT_THREE_POINTERS_PERCENTAGE 1283 non-null float64 13 OPPONENT_TWO_POINTER_PER_100_POSSESION 1402 non-null float64 14 OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION 1402 non-null float64 15 OPPONENT_TWO_POINTERS_PERCENTAGE 1402 non-null float64 16 OPPONENT_FREE_THROWS_PER_100_POSSESION 1402 non-null float64 17 OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION 1402 non-null float64 18 OPPONENT_FREE_THROWS_PERCENTAGE 1402 non-null float64 19 OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION 1402 non-null float64 20 OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION 1402 non-null float64 21 OPPONENT_TOTAL_REBOUND_PER_100_POSSESION 1402 non-null float64 22 OPPONENT_ASSIST_PER_100_POSSESION 1402 non-null float64 23 OPPONENT_STEAL_PER_100_POSSESION 1402 non-null float64 24 OPPONENT_STEAL_PER_100_POSSESION.1 1402 non-null float64 25 OPPONENT_TURNOVER_PER_100_POSSESION 1402 non-null float64 26 OPPONENT_PERSONAL_FOUL_PER_100_POSSESION 1402 non-null float64 27 OPPONENT_POINTS_PER_100_POSSESION 1402 non-null float64 dtypes: bool(1), float64(21), int64(3), object(3) memory usage: 297.2+ KB
df_opponents_s_per_game.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1845 entries, 0 to 1844 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1845 non-null int64 1 LEAGUE 1845 non-null object 2 TEAM 1845 non-null object 3 TEAM_ABBREAVIATION 1758 non-null object 4 PLAYOFFS 1845 non-null bool 5 GAMES 1844 non-null float64 6 MINUTES_PLAYED_PER_GAME 1655 non-null float64 7 OPPONENTS_FIELD_GOALS_PER_GAME 1605 non-null float64 8 OPPONENTS_FIELD_GOALS_Attempted_per_game 1605 non-null float64 9 OPPONENT_FIELD_GOALS_PERCENTAGE 1605 non-null float64 10 OPPONENTS_THREE_POINTERS_PER_GAME 1402 non-null float64 11 PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME 1402 non-null float64 12 OPPONENTS_THREE_POINTERS_PERCENTAGE 1402 non-null float64 13 OPPONENTS_TWO_POINTERS_PER_GAME 1605 non-null float64 14 OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME 1605 non-null float64 15 OPPONENTS_TWO_POINTERS_PERCENTAGE 1605 non-null float64 16 OPPONENTS_FREE_THROWS_PER_GAME 1605 non-null float64 17 OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME 1605 non-null float64 18 OPPONENTS_FREE_THROWS_PERCENTAGE 1605 non-null float64 19 OPPONENTS_OFFENSIVE_REBOUND_PER_GAME 1482 non-null float64 20 OPPONENTS_DEFENSIVE_REBOUND_PER_GAME 1482 non-null float64 21 OPPONENTS_TOTAL_REBOUND_PER_GAME 1601 non-null float64 22 OPPONENTS_ASSIST_PER_GAME 1605 non-null float64 23 OPPONENTS_STEAL_PER_GAME 1456 non-null float64 24 OPPONENTSBLOCKS_PER_GAME 1456 non-null float64 25 OPPONENTS_TURNOVER_PER_GAME 1527 non-null float64 26 OPPONENTS_PERSONAL_FOUL_PER_GAME 1605 non-null float64 27 OPPONENTS_POINTS_PER_GAME 1844 non-null float64 dtypes: bool(1), float64(23), int64(1), object(3) memory usage: 391.1+ KB
df_opponent_totals.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1845 entries, 0 to 1844 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1845 non-null int64 1 LEAGUE 1845 non-null object 2 TEAM 1845 non-null object 3 ABBREVIATION 1758 non-null object 4 PLAYOFFS 1845 non-null bool 5 GAMES 1844 non-null float64 6 MINUTES_PLAYED 1655 non-null float64 7 OPPONENT_FIELD_GOALS 1605 non-null float64 8 OPPONENT_FIELD_GOALS_ATTEMPTED 1605 non-null float64 9 OPPPONENT_FIELD_GOALS_PERCENTAGE 1605 non-null float64 10 OPPONENT_THREE_POINTERS 1402 non-null float64 11 OPPONENT_THREE_POINTERS_ATTEMPTED 1402 non-null float64 12 OPPONENT_THREE_POINTERS_PERCENTAGE 1402 non-null float64 13 OPPONENT_TWO_POINTERS 1605 non-null float64 14 OPPONENT_TWO_POINTERS_ATTEMPTED 1605 non-null float64 15 OPPONENT_TWO_POINTERS_PERCENTAGE 1605 non-null float64 16 OPPONENT_FREE_THROWS 1605 non-null float64 17 OPPONENT_FREE_THROWS_ATTEMPTED 1605 non-null float64 18 OPPONENT_FREE_THROW_PERCENTAGE 1605 non-null float64 19 OPPONENTS_OFFENSIVE_REBOUND 1482 non-null float64 20 OPPONENTS_DEFENSIVE_REBOUND 1482 non-null float64 21 OPPONENTS_TOTAL_REBOUND 1601 non-null float64 22 OPPONENTS_ASSIST 1605 non-null float64 23 OPPONENTS_STEALS 1456 non-null float64 24 OPPONENTS_BLOCKS 1456 non-null float64 25 OPPONENTS_TURNOVERS 1527 non-null float64 26 OPPONENTS_PERSONAL_FOUL 1605 non-null float64 27 OPPONENTS_POINTS 1844 non-null float64 dtypes: bool(1), float64(23), int64(1), object(3) memory usage: 391.1+ KB
del df_opponent_stats_100['TEAM_NAME']
del df_opponent_totals['ABBREVIATION']
def common_columns(dataframes):
# Create a dictionary to count the occurrences of each column
column_counts = {}
# Iterate over all dataframes
for df in dataframes:
# Iterate over all columns in the current dataframe
for col in df.columns:
# If the column is already in the dictionary, increment its count
if col in column_counts:
column_counts[col] += 1
# Otherwise, add the column to the dictionary with a count of 1
else:
column_counts[col] = 1
# Return a list of columns that appear at least twice
return [col for col, count in column_counts.items() if count >= 2]
dfs2=[df_opponent_stats_100,df_opponents_s_per_game,df_opponent_totals]
# dfs is your list of dataframes
common_cols2 = common_columns(dfs2)
print(common_cols2)
['SEASON', 'LEAGUE', 'PLAYOFFS', 'GAMES', 'MINUTES_PLAYED', 'OPPONENT_FIELD_GOALS_PERCENTAGE', 'OPPONENT_THREE_POINTERS_PERCENTAGE', 'OPPONENT_TWO_POINTERS_PERCENTAGE', 'TEAM']
m1 = pd.merge(df_opponent_stats_100, df_opponents_s_per_game, on =['SEASON', 'LEAGUE', 'PLAYOFFS', 'GAMES', 'OPPONENT_FIELD_GOALS_PERCENTAGE'], how ='left')
opponent_team_df = pd.merge(m1, df_opponent_totals, on =['SEASON', 'LEAGUE', 'PLAYOFFS', 'GAMES', 'MINUTES_PLAYED', 'OPPONENT_THREE_POINTERS_PERCENTAGE', 'OPPONENT_TWO_POINTERS_PERCENTAGE', 'TEAM'], how ='left')
opponent_team_df.describe()
| SEASON | GAMES | MINUTES_PLAYED | OPPONENT_FIELD_GOALS_PER_100_POSSESION | OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION | OPPONENT_FIELD_GOALS_PERCENTAGE | OPPONENT_THREE_POINTERS_PER_100_POSSESION | OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION | OPPONENT_THREE_POINTERS_PERCENTAGE | OPPONENT_TWO_POINTER_PER_100_POSSESION | ... | OPPONENT_FREE_THROW_PERCENTAGE | OPPONENTS_OFFENSIVE_REBOUND | OPPONENTS_DEFENSIVE_REBOUND | OPPONENTS_TOTAL_REBOUND | OPPONENTS_ASSIST | OPPONENTS_STEALS | OPPONENTS_BLOCKS | OPPONENTS_TURNOVERS | OPPONENTS_PERSONAL_FOUL | OPPONENTS_POINTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1817.000000 | 1817.000000 | 1817.000000 | 1817.000000 | 1817.000000 | 1817.000000 | 1673.000000 | 1673.000000 | 1673.000000 | 1817.000000 | ... | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 | 1406.000000 |
| mean | 2000.765548 | 80.104568 | 19360.209136 | 40.553990 | 87.672262 | 0.462651 | 6.096533 | 17.246742 | 0.338803 | 34.940451 | ... | 0.757070 | 999.862020 | 2463.342817 | 3463.204836 | 1871.770270 | 649.174964 | 401.936700 | 1268.086771 | 1786.110953 | 8281.305121 |
| std | 14.414455 | 6.636741 | 1605.170296 | 1.894964 | 2.328820 | 0.019595 | 3.698137 | 9.985400 | 0.038049 | 4.564443 | ... | 0.017028 | 195.102932 | 277.961134 | 381.363989 | 256.438149 | 103.089005 | 70.476009 | 233.622830 | 264.816083 | 927.387201 |
| min | 1974.000000 | 11.000000 | 2640.000000 | 34.900000 | 78.000000 | 0.402000 | 0.200000 | 1.500000 | 0.115000 | 23.700000 | ... | 0.694000 | 176.000000 | 360.000000 | 536.000000 | 255.000000 | 100.000000 | 53.000000 | 185.000000 | 273.000000 | 1138.000000 |
| 25% | 1989.000000 | 82.000000 | 19755.000000 | 39.200000 | 86.200000 | 0.450000 | 2.800000 | 8.400000 | 0.329000 | 31.300000 | ... | 0.746000 | 868.000000 | 2345.000000 | 3350.000000 | 1736.250000 | 589.250000 | 357.000000 | 1130.000000 | 1653.000000 | 7942.000000 |
| 50% | 2002.000000 | 82.000000 | 19805.000000 | 40.600000 | 87.800000 | 0.462000 | 6.100000 | 17.300000 | 0.349000 | 33.900000 | ... | 0.757000 | 1000.500000 | 2479.500000 | 3497.000000 | 1892.000000 | 648.000000 | 403.000000 | 1246.000000 | 1808.000000 | 8421.500000 |
| 75% | 2013.000000 | 82.000000 | 19855.000000 | 41.800000 | 89.200000 | 0.475000 | 8.200000 | 22.800000 | 0.363000 | 39.300000 | ... | 0.768000 | 1146.000000 | 2631.750000 | 3661.750000 | 2046.750000 | 714.000000 | 446.750000 | 1404.000000 | 1965.750000 | 8821.000000 |
| max | 2024.000000 | 84.000000 | 20460.000000 | 47.000000 | 95.500000 | 0.536000 | 14.700000 | 40.500000 | 0.411000 | 46.300000 | ... | 0.816000 | 1662.000000 | 3154.000000 | 4750.000000 | 2537.000000 | 955.000000 | 654.000000 | 1980.000000 | 2453.000000 | 10723.000000 |
8 rows × 64 columns
opponent_team_df.shape
(1817, 69)
opponent_team_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1817 entries, 0 to 1816 Data columns (total 69 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1817 non-null int64 1 LEAGUE 1817 non-null object 2 TEAM_ABBREVIATION 1817 non-null object 3 PLAYOFFS 1817 non-null bool 4 GAMES 1817 non-null int64 5 MINUTES_PLAYED 1817 non-null int64 6 OPPONENT_FIELD_GOALS_PER_100_POSSESION 1817 non-null float64 7 OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION 1817 non-null float64 8 OPPONENT_FIELD_GOALS_PERCENTAGE 1817 non-null float64 9 OPPONENT_THREE_POINTERS_PER_100_POSSESION 1673 non-null float64 10 OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION 1673 non-null float64 11 OPPONENT_THREE_POINTERS_PERCENTAGE 1673 non-null float64 12 OPPONENT_TWO_POINTER_PER_100_POSSESION 1817 non-null float64 13 OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION 1817 non-null float64 14 OPPONENT_TWO_POINTERS_PERCENTAGE 1817 non-null float64 15 OPPONENT_FREE_THROWS_PER_100_POSSESION 1817 non-null float64 16 OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION 1817 non-null float64 17 OPPONENT_FREE_THROWS_PERCENTAGE 1817 non-null float64 18 OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION 1817 non-null float64 19 OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION 1817 non-null float64 20 OPPONENT_TOTAL_REBOUND_PER_100_POSSESION 1817 non-null float64 21 OPPONENT_ASSIST_PER_100_POSSESION 1817 non-null float64 22 OPPONENT_STEAL_PER_100_POSSESION 1817 non-null float64 23 OPPONENT_STEAL_PER_100_POSSESION.1 1817 non-null float64 24 OPPONENT_TURNOVER_PER_100_POSSESION 1817 non-null float64 25 OPPONENT_PERSONAL_FOUL_PER_100_POSSESION 1817 non-null float64 26 OPPONENT_POINTS_PER_100_POSSESION 1817 non-null float64 27 TEAM 1817 non-null object 28 TEAM_ABBREAVIATION 1806 non-null object 29 MINUTES_PLAYED_PER_GAME 1817 non-null float64 30 OPPONENTS_FIELD_GOALS_PER_GAME 1817 non-null float64 31 OPPONENTS_FIELD_GOALS_Attempted_per_game 1817 non-null float64 32 OPPONENTS_THREE_POINTERS_PER_GAME 1673 non-null float64 33 PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME 1673 non-null float64 34 OPPONENTS_THREE_POINTERS_PERCENTAGE 1673 non-null float64 35 OPPONENTS_TWO_POINTERS_PER_GAME 1817 non-null float64 36 OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME 1817 non-null float64 37 OPPONENTS_TWO_POINTERS_PERCENTAGE 1817 non-null float64 38 OPPONENTS_FREE_THROWS_PER_GAME 1817 non-null float64 39 OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME 1817 non-null float64 40 OPPONENTS_FREE_THROWS_PERCENTAGE 1817 non-null float64 41 OPPONENTS_OFFENSIVE_REBOUND_PER_GAME 1817 non-null float64 42 OPPONENTS_DEFENSIVE_REBOUND_PER_GAME 1817 non-null float64 43 OPPONENTS_TOTAL_REBOUND_PER_GAME 1817 non-null float64 44 OPPONENTS_ASSIST_PER_GAME 1817 non-null float64 45 OPPONENTS_STEAL_PER_GAME 1817 non-null float64 46 OPPONENTSBLOCKS_PER_GAME 1817 non-null float64 47 OPPONENTS_TURNOVER_PER_GAME 1817 non-null float64 48 OPPONENTS_PERSONAL_FOUL_PER_GAME 1817 non-null float64 49 OPPONENTS_POINTS_PER_GAME 1817 non-null float64 50 OPPONENT_FIELD_GOALS 1406 non-null float64 51 OPPONENT_FIELD_GOALS_ATTEMPTED 1406 non-null float64 52 OPPPONENT_FIELD_GOALS_PERCENTAGE 1406 non-null float64 53 OPPONENT_THREE_POINTERS 1283 non-null float64 54 OPPONENT_THREE_POINTERS_ATTEMPTED 1283 non-null float64 55 OPPONENT_TWO_POINTERS 1406 non-null float64 56 OPPONENT_TWO_POINTERS_ATTEMPTED 1406 non-null float64 57 OPPONENT_FREE_THROWS 1406 non-null float64 58 OPPONENT_FREE_THROWS_ATTEMPTED 1406 non-null float64 59 OPPONENT_FREE_THROW_PERCENTAGE 1406 non-null float64 60 OPPONENTS_OFFENSIVE_REBOUND 1406 non-null float64 61 OPPONENTS_DEFENSIVE_REBOUND 1406 non-null float64 62 OPPONENTS_TOTAL_REBOUND 1406 non-null float64 63 OPPONENTS_ASSIST 1406 non-null float64 64 OPPONENTS_STEALS 1406 non-null float64 65 OPPONENTS_BLOCKS 1406 non-null float64 66 OPPONENTS_TURNOVERS 1406 non-null float64 67 OPPONENTS_PERSONAL_FOUL 1406 non-null float64 68 OPPONENTS_POINTS 1406 non-null float64 dtypes: bool(1), float64(61), int64(3), object(4) memory usage: 967.2+ KB
del opponent_team_df['TEAM_ABBREAVIATION']
opponent_team_df.columns.tolist()
['SEASON', 'LEAGUE', 'TEAM_ABBREVIATION', 'PLAYOFFS', 'GAMES', 'MINUTES_PLAYED', 'OPPONENT_FIELD_GOALS_PER_100_POSSESION', 'OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION', 'OPPONENT_FIELD_GOALS_PERCENTAGE', 'OPPONENT_THREE_POINTERS_PER_100_POSSESION', 'OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION', 'OPPONENT_THREE_POINTERS_PERCENTAGE', 'OPPONENT_TWO_POINTER_PER_100_POSSESION', 'OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION', 'OPPONENT_TWO_POINTERS_PERCENTAGE', 'OPPONENT_FREE_THROWS_PER_100_POSSESION', 'OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION', 'OPPONENT_FREE_THROWS_PERCENTAGE', 'OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION', 'OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION', 'OPPONENT_TOTAL_REBOUND_PER_100_POSSESION', 'OPPONENT_ASSIST_PER_100_POSSESION', 'OPPONENT_STEAL_PER_100_POSSESION', 'OPPONENT_STEAL_PER_100_POSSESION.1', 'OPPONENT_TURNOVER_PER_100_POSSESION', 'OPPONENT_PERSONAL_FOUL_PER_100_POSSESION', 'OPPONENT_POINTS_PER_100_POSSESION', 'TEAM', 'MINUTES_PLAYED_PER_GAME', 'OPPONENTS_FIELD_GOALS_PER_GAME', 'OPPONENTS_FIELD_GOALS_Attempted_per_game', 'OPPONENTS_THREE_POINTERS_PER_GAME', 'PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME', 'OPPONENTS_THREE_POINTERS_PERCENTAGE', 'OPPONENTS_TWO_POINTERS_PER_GAME', 'OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME', 'OPPONENTS_TWO_POINTERS_PERCENTAGE', 'OPPONENTS_FREE_THROWS_PER_GAME', 'OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME', 'OPPONENTS_FREE_THROWS_PERCENTAGE', 'OPPONENTS_OFFENSIVE_REBOUND_PER_GAME', 'OPPONENTS_DEFENSIVE_REBOUND_PER_GAME', 'OPPONENTS_TOTAL_REBOUND_PER_GAME', 'OPPONENTS_ASSIST_PER_GAME', 'OPPONENTS_STEAL_PER_GAME', 'OPPONENTSBLOCKS_PER_GAME', 'OPPONENTS_TURNOVER_PER_GAME', 'OPPONENTS_PERSONAL_FOUL_PER_GAME', 'OPPONENTS_POINTS_PER_GAME', 'OPPONENT_FIELD_GOALS', 'OPPONENT_FIELD_GOALS_ATTEMPTED', 'OPPPONENT_FIELD_GOALS_PERCENTAGE', 'OPPONENT_THREE_POINTERS', 'OPPONENT_THREE_POINTERS_ATTEMPTED', 'OPPONENT_TWO_POINTERS', 'OPPONENT_TWO_POINTERS_ATTEMPTED', 'OPPONENT_FREE_THROWS', 'OPPONENT_FREE_THROWS_ATTEMPTED', 'OPPONENT_FREE_THROW_PERCENTAGE', 'OPPONENTS_OFFENSIVE_REBOUND', 'OPPONENTS_DEFENSIVE_REBOUND', 'OPPONENTS_TOTAL_REBOUND', 'OPPONENTS_ASSIST', 'OPPONENTS_STEALS', 'OPPONENTS_BLOCKS', 'OPPONENTS_TURNOVERS', 'OPPONENTS_PERSONAL_FOUL', 'OPPONENTS_POINTS']
opponent_team_df = opponent_team_df.drop([
'LEAGUE',
'OPPONENT_FIELD_GOALS_PER_100_POSSESION',
'OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION',
'OPPONENT_FIELD_GOALS_PERCENTAGE',
'OPPONENT_THREE_POINTERS_PER_100_POSSESION',
'OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION',
'OPPONENT_THREE_POINTERS_PERCENTAGE',
'OPPONENT_TWO_POINTER_PER_100_POSSESION',
'OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION',
'OPPONENT_TWO_POINTERS_PERCENTAGE',
'OPPONENT_FREE_THROWS_PER_100_POSSESION',
'OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION',
'OPPONENT_FREE_THROWS_PERCENTAGE',
'OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION',
'OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION',
'OPPONENT_TOTAL_REBOUND_PER_100_POSSESION',
'OPPONENT_ASSIST_PER_100_POSSESION',
'OPPONENT_STEAL_PER_100_POSSESION',
'OPPONENT_STEAL_PER_100_POSSESION.1',
'OPPONENT_TURNOVER_PER_100_POSSESION',
'OPPONENT_PERSONAL_FOUL_PER_100_POSSESION',
'OPPONENT_POINTS_PER_100_POSSESION',
'MINUTES_PLAYED_PER_GAME',
'PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME',
'OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME',
'OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME',
'OPPONENTS_OFFENSIVE_REBOUND_PER_GAME',
'OPPONENTS_DEFENSIVE_REBOUND_PER_GAME',
'OPPONENT_THREE_POINTERS_ATTEMPTED',
'OPPONENT_TWO_POINTERS_ATTEMPTED',
'OPPONENT_FREE_THROW_PERCENTAGE',
], axis=1)
opponent_team_df.fillna(0, inplace=True)
opponent_team_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1817 entries, 0 to 1816 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1817 non-null int64 1 TEAM_ABBREVIATION 1817 non-null object 2 PLAYOFFS 1817 non-null bool 3 GAMES 1817 non-null int64 4 MINUTES_PLAYED 1817 non-null int64 5 TEAM 1817 non-null object 6 OPPONENTS_FIELD_GOALS_PER_GAME 1817 non-null float64 7 OPPONENTS_FIELD_GOALS_Attempted_per_game 1817 non-null float64 8 OPPONENTS_THREE_POINTERS_PER_GAME 1817 non-null float64 9 OPPONENTS_THREE_POINTERS_PERCENTAGE 1817 non-null float64 10 OPPONENTS_TWO_POINTERS_PER_GAME 1817 non-null float64 11 OPPONENTS_TWO_POINTERS_PERCENTAGE 1817 non-null float64 12 OPPONENTS_FREE_THROWS_PER_GAME 1817 non-null float64 13 OPPONENTS_FREE_THROWS_PERCENTAGE 1817 non-null float64 14 OPPONENTS_TOTAL_REBOUND_PER_GAME 1817 non-null float64 15 OPPONENTS_ASSIST_PER_GAME 1817 non-null float64 16 OPPONENTS_STEAL_PER_GAME 1817 non-null float64 17 OPPONENTSBLOCKS_PER_GAME 1817 non-null float64 18 OPPONENTS_TURNOVER_PER_GAME 1817 non-null float64 19 OPPONENTS_PERSONAL_FOUL_PER_GAME 1817 non-null float64 20 OPPONENTS_POINTS_PER_GAME 1817 non-null float64 21 OPPONENT_FIELD_GOALS 1817 non-null float64 22 OPPONENT_FIELD_GOALS_ATTEMPTED 1817 non-null float64 23 OPPPONENT_FIELD_GOALS_PERCENTAGE 1817 non-null float64 24 OPPONENT_THREE_POINTERS 1817 non-null float64 25 OPPONENT_TWO_POINTERS 1817 non-null float64 26 OPPONENT_FREE_THROWS 1817 non-null float64 27 OPPONENT_FREE_THROWS_ATTEMPTED 1817 non-null float64 28 OPPONENTS_OFFENSIVE_REBOUND 1817 non-null float64 29 OPPONENTS_DEFENSIVE_REBOUND 1817 non-null float64 30 OPPONENTS_TOTAL_REBOUND 1817 non-null float64 31 OPPONENTS_ASSIST 1817 non-null float64 32 OPPONENTS_STEALS 1817 non-null float64 33 OPPONENTS_BLOCKS 1817 non-null float64 34 OPPONENTS_TURNOVERS 1817 non-null float64 35 OPPONENTS_PERSONAL_FOUL 1817 non-null float64 36 OPPONENTS_POINTS 1817 non-null float64 dtypes: bool(1), float64(31), int64(3), object(2) memory usage: 512.9+ KB
opponent_team_df = opponent_team_df.rename(columns={'OPPONENTSBLOCKS_PER_GAME': 'OPPONENTS_BLOCKS_PER_GAME'})
opponent_team_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1817 entries, 0 to 1816 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1817 non-null int64 1 TEAM_ABBREVIATION 1817 non-null object 2 PLAYOFFS 1817 non-null bool 3 GAMES 1817 non-null int64 4 MINUTES_PLAYED 1817 non-null int64 5 TEAM 1817 non-null object 6 OPPONENTS_FIELD_GOALS_PER_GAME 1817 non-null float64 7 OPPONENTS_FIELD_GOALS_Attempted_per_game 1817 non-null float64 8 OPPONENTS_THREE_POINTERS_PER_GAME 1817 non-null float64 9 OPPONENTS_THREE_POINTERS_PERCENTAGE 1817 non-null float64 10 OPPONENTS_TWO_POINTERS_PER_GAME 1817 non-null float64 11 OPPONENTS_TWO_POINTERS_PERCENTAGE 1817 non-null float64 12 OPPONENTS_FREE_THROWS_PER_GAME 1817 non-null float64 13 OPPONENTS_FREE_THROWS_PERCENTAGE 1817 non-null float64 14 OPPONENTS_TOTAL_REBOUND_PER_GAME 1817 non-null float64 15 OPPONENTS_ASSIST_PER_GAME 1817 non-null float64 16 OPPONENTS_STEAL_PER_GAME 1817 non-null float64 17 OPPONENTS_BLOCKS_PER_GAME 1817 non-null float64 18 OPPONENTS_TURNOVER_PER_GAME 1817 non-null float64 19 OPPONENTS_PERSONAL_FOUL_PER_GAME 1817 non-null float64 20 OPPONENTS_POINTS_PER_GAME 1817 non-null float64 21 OPPONENT_FIELD_GOALS 1817 non-null float64 22 OPPONENT_FIELD_GOALS_ATTEMPTED 1817 non-null float64 23 OPPPONENT_FIELD_GOALS_PERCENTAGE 1817 non-null float64 24 OPPONENT_THREE_POINTERS 1817 non-null float64 25 OPPONENT_TWO_POINTERS 1817 non-null float64 26 OPPONENT_FREE_THROWS 1817 non-null float64 27 OPPONENT_FREE_THROWS_ATTEMPTED 1817 non-null float64 28 OPPONENTS_OFFENSIVE_REBOUND 1817 non-null float64 29 OPPONENTS_DEFENSIVE_REBOUND 1817 non-null float64 30 OPPONENTS_TOTAL_REBOUND 1817 non-null float64 31 OPPONENTS_ASSIST 1817 non-null float64 32 OPPONENTS_STEALS 1817 non-null float64 33 OPPONENTS_BLOCKS 1817 non-null float64 34 OPPONENTS_TURNOVERS 1817 non-null float64 35 OPPONENTS_PERSONAL_FOUL 1817 non-null float64 36 OPPONENTS_POINTS 1817 non-null float64 dtypes: bool(1), float64(31), int64(3), object(2) memory usage: 512.9+ KB
opponent_team_df['SEASON'].unique()
array([2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014,
2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003,
2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992,
1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981,
1980, 1979, 1978, 1977, 1976, 1975, 1974])
opponent_team_df['TEAM'].unique()
array(['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets',
'Chicago Bulls', 'Charlotte Hornets', 'Cleveland Cavaliers',
'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
'Portland Trail Blazers', 'Golden State Warriors',
'Houston Rockets', 'Indiana Pacers', 'Los Angeles Clippers',
'Los Angeles Lakers', 'Memphis Grizzlies', 'Miami Heat',
'Milwaukee Bucks', 'Minnesota Timberwolves',
'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder',
'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns',
'Sacramento Kings', 'San Antonio Spurs', 'Toronto Raptors',
'Utah Jazz', 'Washington Wizards', 'League Average',
'Charlotte Bobcats', 'New Orleans Hornets', 'New Jersey Nets',
'Seattle SuperSonics', 'New Orleans/Oklahoma City Hornets',
'Vancouver Grizzlies', 'Washington Bullets', 'Kansas City Kings',
'San Diego Clippers', 'New Orleans Jazz', 'Buffalo Braves',
'New York Nets', 'Kentucky Colonels', 'San Diego Sails',
'Spirits of St. Louis', 'Utah Stars', 'Virginia Squires',
'Kansas City-Omaha Kings', 'Memphis Sounds',
'San Diego Conquistadors', 'Capital Bullets', 'Carolina Cougars',
'Denver Rockets', 'Memphis Tams'], dtype=object)
Although this DataFrame didn't significantly influence our final model, we took a moment to investigate whether there was a substantial variance in the number of injured players per team. We hypothesized that this might impact a team's overall performance and, consequently, their likelihood of winning. It's worth noting that the available data only covers a decade (2010-2020).
injuries_df = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/injuries_2010-2020.csv')
injuries_df.shape
(27105, 5)
injuries_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27105 entries, 0 to 27104 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 27105 non-null object 1 TEAM 27103 non-null object 2 ACQUIRED 9542 non-null object 3 RELINQUISHED 17560 non-null object 4 NOTES 27105 non-null object dtypes: object(5) memory usage: 1.0+ MB
injuries_df['TEAM'].unique()
array(['Bulls', 'Pistons', 'Blazers', 'Nets', 'Nuggets', 'Bucks', 'Kings',
'Bobcats', 'Warriors', 'Suns', 'Heat', 'Thunder', 'Timberwolves',
'Celtics', 'Lakers', 'Rockets', '76ers', 'Cavaliers', 'Clippers',
'Grizzlies', 'Hawks', 'Hornets', 'Jazz', 'Knicks', 'Mavericks',
'Pacers', 'Raptors', 'Spurs', 'Magic', 'Wizards', 'Pelicans', nan,
'Bullets'], dtype=object)
injuries_df['TEAM'].value_counts()
TEAM Spurs 1163 Bucks 1068 Warriors 1060 Rockets 1058 Raptors 1044 Celtics 1040 Nets 1024 Heat 1023 Cavaliers 1001 Mavericks 992 Hawks 975 Nuggets 966 Lakers 959 Knicks 943 76ers 910 Wizards 875 Grizzlies 875 Timberwolves 860 Jazz 841 Magic 834 Pacers 831 Bulls 791 Suns 733 Kings 728 Hornets 719 Clippers 718 Thunder 717 Pistons 714 Blazers 695 Pelicans 576 Bobcats 369 Bullets 1 Name: count, dtype: int64
pd.crosstab(index=injuries_df['TEAM'], columns="count")
| col_0 | count |
|---|---|
| TEAM | |
| 76ers | 910 |
| Blazers | 695 |
| Bobcats | 369 |
| Bucks | 1068 |
| Bullets | 1 |
| Bulls | 791 |
| Cavaliers | 1001 |
| Celtics | 1040 |
| Clippers | 718 |
| Grizzlies | 875 |
| Hawks | 975 |
| Heat | 1023 |
| Hornets | 719 |
| Jazz | 841 |
| Kings | 728 |
| Knicks | 943 |
| Lakers | 959 |
| Magic | 834 |
| Mavericks | 992 |
| Nets | 1024 |
| Nuggets | 966 |
| Pacers | 831 |
| Pelicans | 576 |
| Pistons | 714 |
| Raptors | 1044 |
| Rockets | 1058 |
| Spurs | 1163 |
| Suns | 733 |
| Thunder | 717 |
| Timberwolves | 860 |
| Warriors | 1060 |
| Wizards | 875 |
A visual representation of the distribution of wounded players across teams.
injuries_df['TEAM'].hist()
plt.xlabel('Team Name')
plt.ylabel('Number of Injuries')
plt.title('Histogram of Number of Injuries by Team')
plt.xticks(rotation='vertical')
plt.show()
As we delve into the draft_history dataset with df_draft_history.info(), we’re getting a glimpse of the different chapters of this story. Each column represents a different aspect of the draft history,including the number of non-null entries in each column and the data type of each column. The df_draft_history.head() It gives us a sneak peek into the data.
df_draft_history = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/draft_history.csv')
df_draft_combine_stats =pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/draft_combine_stats.csv')
df_draft_history.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7990 entries, 0 to 7989 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PERSON_ID 7990 non-null int64 1 PLAYER_NAME 7990 non-null object 2 SEASON 7990 non-null int64 3 ROUND_NUMBER 7990 non-null int64 4 ROUND_PICK 7990 non-null int64 5 OVERALL_PICK 7990 non-null int64 6 DRAFT_TYPE 7990 non-null object 7 TEAM_ID 7990 non-null int64 8 TEAM_CITY 7990 non-null object 9 TEAM_NAME 7990 non-null object 10 TEAM_ABBREVIATION 7990 non-null object 11 ORGANIZATION 7971 non-null object 12 ORGANIZATION_TYPE 7971 non-null object 13 PLAYER_PROFILE_FLAG 7990 non-null int64 dtypes: int64(7), object(7) memory usage: 874.0+ KB
df_draft_combine_stats.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1202 entries, 0 to 1201 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEASON 1202 non-null int64 1 PLAYER_ID 1202 non-null int64 2 FIRST_NAME 1202 non-null object 3 LAST_NAME 1202 non-null object 4 PLAYER_NAME 1202 non-null object 5 POSITION 1197 non-null object 6 HEIGHT_WITHOUT_SHOEAS 1153 non-null float64 7 HEIGHT_WITHOUT_SHOES_FEET_INCHES 1153 non-null object 8 HEIGHT_WITH_SHOES 1008 non-null float64 9 HEIGHT_WITH_SHOES_FEET_INCHES 1008 non-null object 10 WEIGHT 1152 non-null float64 11 WINGSPAN 1153 non-null float64 12 WINGSPAN_FEET_INCHES 1153 non-null object 13 STANDING_REACH 1152 non-null float64 14 STANDING_REACH_FT_INCHES 1152 non-null object 15 BODY_FAT_PERCENTAGE 1003 non-null float64 16 HAND_LENGHT 719 non-null float64 17 HAND_WIDTH 719 non-null float64 18 STANDING_VERTICAL_LEAP 1017 non-null float64 19 MAX_VERTICAL_LEAP 1017 non-null float64 20 LANE_AGILITY_TIME 1008 non-null float64 21 MODIFIED_LANE_AGILITY_TIME 411 non-null float64 22 THREE_QUARTER_SPRINT 1012 non-null float64 23 BENCH_PRESSES 808 non-null float64 24 SPOT_FIFTEEN_CORNER_LEFT 74 non-null object 25 SPOT_FIFTEEN_BREAK_LEFT 76 non-null object 26 SPOT_FIFTEEN_TOP_KEY 76 non-null object 27 SPOT_FIFTEEN_BREAK_RIGHT 76 non-null object 28 SPOT_FIFTEEN_CORNER_RIGHT 76 non-null object 29 SPOT_COLLEGE_CORNER_LEFT 245 non-null object 30 SPOT_COLLEGE_BREAK_LEFT 166 non-null object 31 SPOT_COLLEGE_TOP_KEY 166 non-null object 32 SPOT_COLLEGE_BREAK_RIGHT 166 non-null object 33 SPOT_COLLEGE_CORNER_RIGHT 166 non-null object 34 SPOT_NBA_CORNER_LEFT 217 non-null object 35 SPOT_NBA_BREAK_LEFT 217 non-null object 36 SPOT_NBA_TOP_KEY 217 non-null object 37 SPOT_NBA_BREAK_RIGHT 217 non-null object 38 SPOT_NBA_CORNER_RIGHT 217 non-null object 39 OFF_DRIBBLE_FIFTEEN_BREAK_LEFT 166 non-null object 40 OFF_DRIBBLE_FIFTEEN_TOP_KEY 166 non-null object 41 OFF_DRIBBLE_FIFTEEN_BREAK_RIGHT 166 non-null object 42 OFF_DRIBBLE_COLLEGE_BREAK_LEFT 110 non-null object 43 OFF_DRIBBLE_COLLEGE_TOP-KEY 31 non-null object 44 OFF_DRIBBLE_COLLEGE_BREAK_RIGHT 31 non-null object 45 ON_MOVE_FIFTEEN 148 non-null object 46 ON_MOVE_COLLEGE 116 non-null object dtypes: float64(14), int64(2), object(31) memory usage: 441.5+ KB
df_draft_history.head()
| PERSON_ID | PLAYER_NAME | SEASON | ROUND_NUMBER | ROUND_PICK | OVERALL_PICK | DRAFT_TYPE | TEAM_ID | TEAM_CITY | TEAM_NAME | TEAM_ABBREVIATION | ORGANIZATION | ORGANIZATION_TYPE | PLAYER_PROFILE_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 79299 | Clifton McNeeley | 1947 | 1 | 1 | 1 | Draft | 1610610031 | Pittsburgh | Ironmen | PIT | Texas-El Paso | College/University | 0 |
| 1 | 78109 | Glen Selbo | 1947 | 1 | 2 | 2 | Draft | 1610610035 | Toronto | Huskies | HUS | Wisconsin | College/University | 1 |
| 2 | 76649 | Eddie Ehlers | 1947 | 1 | 3 | 3 | Draft | 1610612738 | Boston | Celtics | BOS | Purdue | College/University | 1 |
| 3 | 79302 | Walt Dropo | 1947 | 1 | 4 | 4 | Draft | 1610610032 | Providence | Steamrollers | PRO | Connecticut | College/University | 0 |
| 4 | 77048 | Dick Holub | 1947 | 1 | 5 | 5 | Draft | 1610612752 | New York | Knicks | NYK | Long Island-Brooklyn | College/University | 1 |
df_draft_combine_stats.head()
| SEASON | PLAYER_ID | FIRST_NAME | LAST_NAME | PLAYER_NAME | POSITION | HEIGHT_WITHOUT_SHOEAS | HEIGHT_WITHOUT_SHOES_FEET_INCHES | HEIGHT_WITH_SHOES | HEIGHT_WITH_SHOES_FEET_INCHES | ... | SPOT_NBA_BREAK_RIGHT | SPOT_NBA_CORNER_RIGHT | OFF_DRIBBLE_FIFTEEN_BREAK_LEFT | OFF_DRIBBLE_FIFTEEN_TOP_KEY | OFF_DRIBBLE_FIFTEEN_BREAK_RIGHT | OFF_DRIBBLE_COLLEGE_BREAK_LEFT | OFF_DRIBBLE_COLLEGE_TOP-KEY | OFF_DRIBBLE_COLLEGE_BREAK_RIGHT | ON_MOVE_FIFTEEN | ON_MOVE_COLLEGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2001 | 12033 | Adam | Allenspach | Adam Allenspach | C | 83.50 | 6' 11.5'' | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2001 | 2240 | Gilbert | Arenas | Gilbert Arenas | SG | 74.25 | 6' 2.25'' | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2001 | 2220 | Brandon | Armstrong | Brandon Armstrong | SG | 75.50 | 6' 3.5'' | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2001 | 2203 | Shane | Battier | Shane Battier | SF-PF | 80.25 | 6' 8.25'' | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2001 | 12034 | Cookie | Belcher | Cookie Belcher | SG-PG | 75.00 | 6' 3'' | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 47 columns
Now, the plot thickens. This line is like a leaderboard, ranking the top 20 organizations based on their frequency in the draft history. It’s a testament to their consistent presence and impact in the drafts.
top_20_organizations = df_draft_history['ORGANIZATION'].value_counts().head(20)
print(top_20_organizations)
ORGANIZATION Kentucky 137 California-Los Angeles 120 North Carolina 115 Duke 105 Kansas 87 Indiana 77 Arizona 74 Louisville 73 Michigan 73 Maryland 72 Illinois 71 Syracuse 69 North Carolina State 66 Notre Dame 63 Southern California 62 Villanova 61 St. John's (NY) 61 Washington 60 Michigan State 60 Marquette 58 Name: count, dtype: int64
But what about the underdogs? df_draft_history['ORGANIZATION'].value_counts().tail(20) gives us insight into the organizations that have had less representation in the drafts. This could be the beginning of their journey to rise through the ranks.
df_draft_history['ORGANIZATION'].value_counts().tail(20)
ORGANIZATION West Alabama 1 Bloomsburg 1 Southern Utah 1 Juvecaserta Basket (Italy) 1 Calgary (CAN) 1 Longwood 1 Eastern Oregon 1 Florida International 1 Massachusetts Boston 1 California-Davis 1 Hellenic 1 Lawrence 1 Alabama A&M 1 Pratt 1 Williams 1 Caltech 1 McMurry 1 York (CAN) 1 Indiana Wesleyan 1 Belhaven 1 Name: count, dtype: int64
The print(df_draft_history['OVERALL_PICK'].value_counts()) command is like a roll call, tallying up the frequency of each overall pick number. It’s a snapshot of the distribution of picks across the drafts.
print(df_draft_history['OVERALL_PICK'].value_counts())
OVERALL_PICK
0 655
1 74
2 74
3 74
4 74
...
232 2
231 2
230 2
238 1
239 1
Name: count, Length: 240, dtype: int64
Next, we turn our attention to the top 20 organizations This line is calculating the average overall pick for each of these organizations, shedding light on their typical standing in the drafts.
The bar graph that follows paints a vivid picture of this data. It’s a visual representation of the average overall pick by the top 20 organizations, allowing us to easily compare and contrast their performances.
top_20_orgs = df_draft_history['ORGANIZATION'].value_counts().index[:20]
average_pick = df_draft_history[df_draft_history['ORGANIZATION'].isin(top_20_orgs)].groupby('ORGANIZATION')['OVERALL_PICK'].mean()
plt.figure(figsize=(10,8))
average_pick.plot(kind='bar')
plt.xlabel('Organization')
plt.ylabel('Average Overall Pick')
plt.title('Bar Graph of Average Overall Pick by Top 20 Organizations')
plt.xticks(rotation=90)
plt.show()
Finally, we zoom in on the coveted first overall pick. This line is spotlighting the entries where the overall pick was 1, the pinnacle of achievement in the drafts.
The pie chart at the end of your code is the grand finale. It showcases the top 10 organizations that have had the honor of the first overall pick the most times, with all other organizations grouped into an ‘Other’ category. It’s a tribute to their success and a testament to their dominance in the drafts.
This could lead to further questions and investigations, such as looking into why those organizations have those patterns (e.g., more successful scouting, better player development, etc.).
df_overall_pick_1 = df_draft_history[df_draft_history['OVERALL_PICK'] == 1]
org_counts = df_overall_pick_1['ORGANIZATION'].value_counts()
# Get the top 20 organizations
top_20_orgs_counts = org_counts.nlargest(10)
# Add a new category for all other organizations
top_20_orgs_counts['Other'] = org_counts[10:].sum()
plt.figure(figsize=(10,8)) # Adjust the size as necessary
plt.pie(top_20_orgs_counts, labels=top_20_orgs_counts.index, autopct='%1.1f%%')
plt.title('Pie Chart of Top 20 Organizations with Overall Pick #1 (and Others)')
plt.show()
all_star = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/All-Star Selections.csv')
all_star.shape
(2000, 5)
all_star.describe()
| SEASON | |
|---|---|
| count | 2000.000000 |
| mean | 1987.016000 |
| std | 20.730307 |
| min | 1951.000000 |
| 25% | 1970.000000 |
| 50% | 1985.000000 |
| 75% | 2005.250000 |
| max | 2024.000000 |
all_star.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PLAYER 2000 non-null object 1 TEAM 2000 non-null object 2 LEAGUE 2000 non-null object 3 SEASON 2000 non-null int64 4 REPLACED 2000 non-null bool dtypes: bool(1), int64(1), object(3) memory usage: 64.6+ KB
all_star.head()
| PLAYER | TEAM | LEAGUE | SEASON | REPLACED | |
|---|---|---|---|---|---|
| 0 | Bam Adebayo | East | NBA | 2024 | False |
| 1 | Giannis Antetokounmpo | East | NBA | 2024 | False |
| 2 | Paolo Banchero | East | NBA | 2024 | False |
| 3 | Scottie Barnes | East | NBA | 2024 | False |
| 4 | Jaylen Brown | East | NBA | 2024 | False |
all_star.tail()
| PLAYER | TEAM | LEAGUE | SEASON | REPLACED | |
|---|---|---|---|---|---|
| 1995 | Alex Groza | West | NBA | 1951 | False |
| 1996 | George Mikan | West | NBA | 1951 | False |
| 1997 | Vern Mikkelsen | West | NBA | 1951 | False |
| 1998 | Jim Pollard | West | NBA | 1951 | False |
| 1999 | Fred Schaus | West | NBA | 1951 | False |
all_star['TEAM'].value_counts()
TEAM East 909 West 907 Team LeBron 67 Team Durant 29 Team Giannis 26 LeBron 14 Giannis 13 Team Stephen 12 All Stars 12 Denver 11 Name: count, dtype: int64
all_star = all_star[all_star['TEAM'] != 'Denver']
all_star['TEAM'].value_counts()
TEAM East 909 West 907 Team LeBron 67 Team Durant 29 Team Giannis 26 LeBron 14 Giannis 13 Team Stephen 12 All Stars 12 Name: count, dtype: int64
all_star['PLAYER'].value_counts()
PLAYER
LeBron James 20
Kareem Abdul-Jabbar 19
Kobe Bryant 18
Julius Erving 16
Tim Duncan 15
..
Michael Redd 1
Kenyon Martin 1
Jamaal Magloire 1
Metta World Peace 1
Fred Schaus 1
Name: count, Length: 520, dtype: int64
The bar plot is a visual representation of the player counts, allowing us to easily compare and contrast their performances. It’s like a hall of fame, honoring the players who have made it to the All-Star game the most times. This could lead to further questions and investigations, such as looking into why those players are frequently selected (e.g., outstanding performance, popularity, etc.).
team_counts = all_star['PLAYER'].value_counts().sort_values(ascending=False).head(50)
plt.figure(figsize=(10, 5))
#plt.pie(team_counts, labels=team_counts.index, autopct='%1.1f%%')
plt.bar(team_counts.index, team_counts.values, color='b')
plt.xlabel('Player Name')
plt.ylabel('Number of Occurrences')
plt.title('All-Star Appearances')
plt.xticks(rotation='vertical')
plt.show()
We'll now deal with teh Teams dataframes
#Importing all dataframes for the teams that will be needed to create a unified team dataframe
teams_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/teams.csv") #ONLY HAS BASIC INFO
team_totals_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/Team Totals.csv")
team_sum_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/Team Summaries.csv")
team_statPG_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/Team Stats Per Game.csv")
print(team_sum_df.shape)
(1845, 31)
print(team_sum_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 AGE 1781 W 1757 L 1757 PW 1844 PL 1844 MARGIN_OF_VICTORY 1844 STRENGTH_OF_SCHEDULE 1844 SIMPLE_RATING_SYSTEM 1844 OFFENSIVE_RATING 1792 DEFENSIVE_RATING 1792 NET_RATING 1709 PACE 1792 FREE_THROW_Rate 1844 THREE_POINTER_ATTEMPT_RATE 1402 TRUE_SHOOTING_PERCENTAGE 1844 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1844 TURNOVER_PERCENTAGE 1527 OFFENSIVE_REBOUND_PERCENTAGE 1479 FREE_THROWS_PER_FIELD_GOAL_ATTEMPT 1844 OPPONENT_EFFECTIVE_FIELD_GAL 1581 OPPONENT_FREE_THROW_PERCENTAGE 1527 OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE 1479 OPPONENT_FREE_THROW_PERCENTAGE.1 1581 ARENA 1757 ATTEND 1360 ATTEND_G 967 dtype: int64
# Do a left merge on the dataframes on SEASON, LEAGUE, and TEAM columns
# This is done to assert all dataframes have a corresponding win and loss column
team_totals_df = pd.merge(team_totals_df, team_sum_df[['SEASON', 'LEAGUE', 'TEAM', 'W', 'L']], on=['SEASON', 'LEAGUE', 'TEAM'], how='left')
# Display the merged dataframe
print(team_totals_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 GAMES 1844 MINUTES_PLAYED 1655 FIELD_GOALS_MADE 1844 FIELD_GOALS_ATTEMPTED 1844 FIELD_GOALS_PERCENTAGE 1844 THREE_POINTERS_MADE 1402 THREE_POINTERS_ATTEMPTED 1402 THREE_POINTERS_PERCENTAGE 1402 TWO_POINTERS_MADE 1844 TWO_POINTERS_ATTEMPTED 1844 TWO_POINTERS_PERCENTAGE 1844 FREE_THROWS_MADE 1844 FREE_THROWS_ATTEMPTED 1844 FREE_THROW_PERCENTAGE 1844 OFFENSIVE_REBOUND 1515 DEFENSIVE_REBOUND 1515 TOTAL_REBOUNDS 1792 ASSISTS 1844 STEALS 1456 BLOCKS 1456 TURNOVERS 1527 PERSONAL FOULS 1844 POINTS 1844 W 1757 L 1757 dtype: int64
# Do a left merge on the dataframes on SEASON, LEAGUE, and TEAM columns
# This is done to assert all dataframes have a corresponding win and loss column
team_statPG_df = pd.merge(team_statPG_df, team_sum_df[['SEASON', 'LEAGUE', 'TEAM', 'W', 'L']], on=['SEASON', 'LEAGUE', 'TEAM'], how='left')
# Display the merged dataframe
print(team_statPG_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 GAMES 1844 mp_per_game 1655 fg_per_game 1844 FIELD_GOAL_AGAINST_PER_GAME 1844 FIELD_GOAL_PERCENTAGE 1844 THREE_POINTERS_PER_GAME 1402 THREE_POINTERS_AGAINST_PER_GAME 1402 THREE_POINTER_POSITION 1402 TWO_POINTERS_PER_GAME 1844 TWO_POINTER_AGAINST_PER_GAME 1844 TWO_POINTER_PERCENTAGE 1844 FREE_THROW_PER_GAME 1844 FREE_THROW_AGAINST_PER_GAME 1844 FREE_THROW_PERCENTAGE 1844 OFFENSIVE_Rebound_per_game 1515 DEFENSIVE_REBOUND_PER_GAME 1515 TOTAL_REBOUND_PER_GAME 1792 ASSIST_PER_GAME 1844 STEALS_PER_GAME 1456 BLOCKS_PER_GAME 1456 TURNOVER_PER_GAME 1527 PERSONAL_FOULS_PER_GAME 1844 POINTS_PER_GAME 1844 W 1757 L 1757 dtype: int64
We'll now do a Heatmap to determine which variables are mostly related with the number of Wins versus Loss
# Calculate the correlation matrix
correlation_matrix = team_sum_df.corr(numeric_only=True)
# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
#plt.show()
# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)
print("Correlation with Wins:")
print(correlation_with_wins)
print("\nCorrelation with Losses:")
print(correlation_with_losses)
Correlation with Wins: W 1.000000 PW 0.968978 NET_RATING 0.923874 MARGIN_OF_VICTORY 0.917755 SIMPLE_RATING_SYSTEM 0.915788 PLAYOFFS 0.718380 AGE 0.439181 OFFENSIVE_RATING 0.358557 TRUE_SHOOTING_PERCENTAGE 0.356146 EFFECTIEV_FIELD_GOAL_PERCENTAGE 0.322930 ATTEND 0.289586 ATTEND_G 0.203818 FREE_THROWS_PER_FIELD_GOAL_ATTEMPT 0.142542 OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE 0.093637 OFFENSIVE_REBOUND_PERCENTAGE 0.089972 FREE_THROW_Rate 0.082298 OPPONENT_FREE_THROW_PERCENTAGE 0.077277 SEASON 0.035656 PACE 0.005594 THREE_POINTER_ATTEMPT_RATE -0.008625 TURNOVER_PERCENTAGE -0.130810 OPPONENT_FREE_THROW_PERCENTAGE.1 -0.196387 DEFENSIVE_RATING -0.285715 OPPONENT_EFFECTIVE_FIELD_GAL -0.396491 STRENGTH_OF_SCHEDULE -0.475727 PL -0.773434 L -0.814865 Name: W, dtype: float64 Correlation with Losses: L 1.000000 PL 0.968569 STRENGTH_OF_SCHEDULE 0.489078 DEFENSIVE_RATING 0.331425 OPPONENT_EFFECTIVE_FIELD_GAL 0.312214 OPPONENT_FREE_THROW_PERCENTAGE.1 0.264494 TURNOVER_PERCENTAGE 0.214612 SEASON 0.035872 PACE 0.031618 OPPONENT_FREE_THROW_PERCENTAGE 0.006004 OFFENSIVE_REBOUND_PERCENTAGE 0.000353 ATTEND -0.092904 FREE_THROW_Rate -0.114637 EFFECTIEV_FIELD_GOAL_PERCENTAGE -0.123563 THREE_POINTER_ATTEMPT_RATE -0.131948 FREE_THROWS_PER_FIELD_GOAL_ATTEMPT -0.146011 TRUE_SHOOTING_PERCENTAGE -0.153538 ATTEND_G -0.167581 OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE -0.193554 OFFENSIVE_RATING -0.310158 AGE -0.444833 PLAYOFFS -0.692873 PW -0.778654 W -0.814865 SIMPLE_RATING_SYSTEM -0.908782 MARGIN_OF_VICTORY -0.912530 NET_RATING -0.919095 Name: L, dtype: float64
team_sum_df.drop(columns=['FREE_THROWS_PER_FIELD_GOAL_ATTEMPT', 'OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE',
'OFFENSIVE_REBOUND_PERCENTAGE', 'FREE_THROW_Rate',
'OPPONENT_FREE_THROW_PERCENTAGE', 'PACE',
'ARENA', 'ATTEND', 'ATTEND_G',
'THREE_POINTER_ATTEMPT_RATE'], inplace=True)
print(team_sum_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 AGE 1781 W 1757 L 1757 PW 1844 PL 1844 MARGIN_OF_VICTORY 1844 STRENGTH_OF_SCHEDULE 1844 SIMPLE_RATING_SYSTEM 1844 OFFENSIVE_RATING 1792 DEFENSIVE_RATING 1792 NET_RATING 1709 TRUE_SHOOTING_PERCENTAGE 1844 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1844 TURNOVER_PERCENTAGE 1527 OPPONENT_EFFECTIVE_FIELD_GAL 1581 OPPONENT_FREE_THROW_PERCENTAGE.1 1581 dtype: int64
#Drop rows where columns have just a few missing value
team_sum_df.dropna(subset=['W', 'L'], inplace=True)
print(team_sum_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 AGE 1698 W 1757 L 1757 PW 1757 PL 1757 MARGIN_OF_VICTORY 1757 STRENGTH_OF_SCHEDULE 1757 SIMPLE_RATING_SYSTEM 1757 OFFENSIVE_RATING 1709 DEFENSIVE_RATING 1709 NET_RATING 1709 TRUE_SHOOTING_PERCENTAGE 1757 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1757 TURNOVER_PERCENTAGE 1467 OPPONENT_EFFECTIVE_FIELD_GAL 1518 OPPONENT_FREE_THROW_PERCENTAGE.1 1518 dtype: int64
#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')
# Define the columns to impute
columns_to_impute = ['AGE', 'DEFENSIVE_RATING', 'OFFENSIVE_RATING', 'NET_RATING',
'OPPONENT_EFFECTIVE_FIELD_GAL',
'OPPONENT_FREE_THROW_PERCENTAGE.1',
'TURNOVER_PERCENTAGE']
# Initialize the imputer
imputer = SimpleImputer(strategy='mean')
# Iterate over each column to impute
for column in columns_to_impute:
# Group by 'TEAM' and apply the imputer to each group
team_sum_df[column] = team_sum_df.groupby('TEAM')[column].transform(
lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
)
# Print the count of non-null values after imputation
print(team_sum_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 AGE 1736 W 1757 L 1757 PW 1757 PL 1757 MARGIN_OF_VICTORY 1757 STRENGTH_OF_SCHEDULE 1757 SIMPLE_RATING_SYSTEM 1757 OFFENSIVE_RATING 1738 DEFENSIVE_RATING 1738 NET_RATING 1738 TRUE_SHOOTING_PERCENTAGE 1757 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1757 TURNOVER_PERCENTAGE 1608 OPPONENT_EFFECTIVE_FIELD_GAL 1653 OPPONENT_FREE_THROW_PERCENTAGE.1 1653 dtype: int64
#Drop rows where teams did not have any data to help with imputation
team_sum_df.dropna(subset=['AGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL',
'OPPONENT_FREE_THROW_PERCENTAGE.1'], inplace=True)
print(team_sum_df.notnull().sum())
SEASON 1608 LEAGUE 1608 TEAM 1608 ABBREVIATION 1608 PLAYOFFS 1608 AGE 1608 W 1608 L 1608 PW 1608 PL 1608 MARGIN_OF_VICTORY 1608 STRENGTH_OF_SCHEDULE 1608 SIMPLE_RATING_SYSTEM 1608 OFFENSIVE_RATING 1608 DEFENSIVE_RATING 1608 NET_RATING 1608 TRUE_SHOOTING_PERCENTAGE 1608 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1608 TURNOVER_PERCENTAGE 1608 OPPONENT_EFFECTIVE_FIELD_GAL 1608 OPPONENT_FREE_THROW_PERCENTAGE.1 1608 dtype: int64
print(team_totals_df.shape)
(1845, 30)
print(team_totals_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 GAMES 1844 MINUTES_PLAYED 1655 FIELD_GOALS_MADE 1844 FIELD_GOALS_ATTEMPTED 1844 FIELD_GOALS_PERCENTAGE 1844 THREE_POINTERS_MADE 1402 THREE_POINTERS_ATTEMPTED 1402 THREE_POINTERS_PERCENTAGE 1402 TWO_POINTERS_MADE 1844 TWO_POINTERS_ATTEMPTED 1844 TWO_POINTERS_PERCENTAGE 1844 FREE_THROWS_MADE 1844 FREE_THROWS_ATTEMPTED 1844 FREE_THROW_PERCENTAGE 1844 OFFENSIVE_REBOUND 1515 DEFENSIVE_REBOUND 1515 TOTAL_REBOUNDS 1792 ASSISTS 1844 STEALS 1456 BLOCKS 1456 TURNOVERS 1527 PERSONAL FOULS 1844 POINTS 1844 W 1757 L 1757 dtype: int64
# Doing a Heatmap to determine which variables are mostly related with the number of Wins versus Loss, again
# Calculate the correlation matrix
correlation_matrix = team_totals_df.corr(numeric_only=True)
# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)
print("Correlation with Wins:")
print(correlation_with_wins)
print("\nCorrelation with Losses:")
print(correlation_with_losses)
Correlation with Wins: W 1.000000 PLAYOFFS 0.718380 DEFENSIVE_REBOUND 0.408673 ASSISTS 0.407251 POINTS 0.389685 FIELD_GOALS_PERCENTAGE 0.385399 FIELD_GOALS_MADE 0.353058 BLOCKS 0.318962 TWO_POINTERS_PERCENTAGE 0.316198 GAMES 0.313225 MINUTES_PLAYED 0.276975 STEALS 0.264540 FREE_THROWS_MADE 0.255571 TWO_POINTERS_MADE 0.245274 FIELD_GOALS_ATTEMPTED 0.222260 FREE_THROWS_ATTEMPTED 0.210115 TOTAL_REBOUNDS 0.203584 FREE_THROW_PERCENTAGE 0.200805 THREE_POINTERS_PERCENTAGE 0.133557 TWO_POINTERS_ATTEMPTED 0.107317 OFFENSIVE_REBOUND 0.092751 PERSONAL FOULS 0.089372 THREE_POINTERS_MADE 0.066237 TURNOVERS 0.052525 SEASON 0.035656 THREE_POINTERS_ATTEMPTED 0.033770 L -0.814865 Name: W, dtype: float64 Correlation with Losses: L 1.000000 PERSONAL FOULS 0.305033 GAMES 0.295246 TURNOVERS 0.280257 MINUTES_PLAYED 0.259580 FIELD_GOALS_ATTEMPTED 0.238983 OFFENSIVE_REBOUND 0.197767 TWO_POINTERS_ATTEMPTED 0.170492 TWO_POINTERS_MADE 0.129778 FIELD_GOALS_MADE 0.126110 POINTS 0.105312 FREE_THROWS_ATTEMPTED 0.068892 STEALS 0.056192 FREE_THROWS_MADE 0.046885 SEASON 0.035872 TOTAL_REBOUNDS 0.035398 ASSISTS 0.019720 DEFENSIVE_REBOUND -0.007304 BLOCKS -0.080132 THREE_POINTERS_ATTEMPTED -0.081842 FIELD_GOALS_PERCENTAGE -0.101884 FREE_THROW_PERCENTAGE -0.107084 TWO_POINTERS_PERCENTAGE -0.116543 THREE_POINTERS_MADE -0.117717 THREE_POINTERS_PERCENTAGE -0.200552 PLAYOFFS -0.692873 W -0.814865 Name: L, dtype: float64
#Drop Columns we might not use
team_totals_df.drop(columns=['THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_ATTEMPTED',
'OFFENSIVE_REBOUND', 'PERSONAL FOULS', 'THREE_POINTERS_MADE',
'THREE_POINTERS_ATTEMPTED',
], inplace=True)
print(team_totals_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 GAMES 1844 MINUTES_PLAYED 1655 FIELD_GOALS_MADE 1844 FIELD_GOALS_ATTEMPTED 1844 FIELD_GOALS_PERCENTAGE 1844 TWO_POINTERS_MADE 1844 TWO_POINTERS_PERCENTAGE 1844 FREE_THROWS_MADE 1844 FREE_THROWS_ATTEMPTED 1844 FREE_THROW_PERCENTAGE 1844 DEFENSIVE_REBOUND 1515 TOTAL_REBOUNDS 1792 ASSISTS 1844 STEALS 1456 BLOCKS 1456 TURNOVERS 1527 POINTS 1844 W 1757 L 1757 dtype: int64
#Drop rows where columns have just a few missing value
team_totals_df.dropna(subset=['W', 'L'], inplace=True)
print(team_totals_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 GAMES 1757 MINUTES_PLAYED 1586 FIELD_GOALS_MADE 1757 FIELD_GOALS_ATTEMPTED 1757 FIELD_GOALS_PERCENTAGE 1757 TWO_POINTERS_MADE 1757 TWO_POINTERS_PERCENTAGE 1757 FREE_THROWS_MADE 1757 FREE_THROWS_ATTEMPTED 1757 FREE_THROW_PERCENTAGE 1757 DEFENSIVE_REBOUND 1456 TOTAL_REBOUNDS 1709 ASSISTS 1757 STEALS 1402 BLOCKS 1402 TURNOVERS 1467 POINTS 1757 W 1757 L 1757 dtype: int64
#Impute values based on teams with the same name, and in the same season
#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')
# Define the columns to impute
columns_to_impute = ['MINUTES_PLAYED', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'STEALS', 'BLOCKS', 'TURNOVERS']
# Initialize the imputer
imputer = SimpleImputer(strategy='mean')
# Iterate over each column to impute
for column in columns_to_impute:
# Group by 'TEAM' and apply the imputer to each group
team_totals_df[column] = team_totals_df.groupby('TEAM')[column].transform(
lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
)
# Print the count of non-null values after imputation
print(team_totals_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 GAMES 1757 MINUTES_PLAYED 1666 FIELD_GOALS_MADE 1757 FIELD_GOALS_ATTEMPTED 1757 FIELD_GOALS_PERCENTAGE 1757 TWO_POINTERS_MADE 1757 TWO_POINTERS_PERCENTAGE 1757 FREE_THROWS_MADE 1757 FREE_THROWS_ATTEMPTED 1757 FREE_THROW_PERCENTAGE 1757 DEFENSIVE_REBOUND 1605 TOTAL_REBOUNDS 1738 ASSISTS 1757 STEALS 1573 BLOCKS 1573 TURNOVERS 1608 POINTS 1757 W 1757 L 1757 dtype: int64
#Drop rows where teams did not have any data to help with imputation
team_totals_df.dropna(subset=['MINUTES_PLAYED', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS'], inplace=True)
print(team_totals_df.notnull().sum())
SEASON 1605 LEAGUE 1605 TEAM 1605 ABBREVIATION 1605 PLAYOFFS 1605 GAMES 1605 MINUTES_PLAYED 1605 FIELD_GOALS_MADE 1605 FIELD_GOALS_ATTEMPTED 1605 FIELD_GOALS_PERCENTAGE 1605 TWO_POINTERS_MADE 1605 TWO_POINTERS_PERCENTAGE 1605 FREE_THROWS_MADE 1605 FREE_THROWS_ATTEMPTED 1605 FREE_THROW_PERCENTAGE 1605 DEFENSIVE_REBOUND 1605 TOTAL_REBOUNDS 1605 ASSISTS 1605 STEALS 1573 BLOCKS 1573 TURNOVERS 1605 POINTS 1605 W 1605 L 1605 dtype: int64
# Fill null values in 'STEALS' and 'BLOCKS' columns with zero
team_totals_df['STEALS'].fillna(0, inplace=True)
team_totals_df['BLOCKS'].fillna(0, inplace=True)
# Check the count of non-null values after filling null values
print(team_totals_df.notnull().sum())
SEASON 1605 LEAGUE 1605 TEAM 1605 ABBREVIATION 1605 PLAYOFFS 1605 GAMES 1605 MINUTES_PLAYED 1605 FIELD_GOALS_MADE 1605 FIELD_GOALS_ATTEMPTED 1605 FIELD_GOALS_PERCENTAGE 1605 TWO_POINTERS_MADE 1605 TWO_POINTERS_PERCENTAGE 1605 FREE_THROWS_MADE 1605 FREE_THROWS_ATTEMPTED 1605 FREE_THROW_PERCENTAGE 1605 DEFENSIVE_REBOUND 1605 TOTAL_REBOUNDS 1605 ASSISTS 1605 STEALS 1605 BLOCKS 1605 TURNOVERS 1605 POINTS 1605 W 1605 L 1605 dtype: int64
#Dealing with the team_Stats_Per_Game
print(team_statPG_df.shape)
(1845, 30)
print(team_statPG_df.notnull().sum())
SEASON 1845 LEAGUE 1845 TEAM 1845 ABBREVIATION 1758 PLAYOFFS 1845 GAMES 1844 mp_per_game 1655 fg_per_game 1844 FIELD_GOAL_AGAINST_PER_GAME 1844 FIELD_GOAL_PERCENTAGE 1844 THREE_POINTERS_PER_GAME 1402 THREE_POINTERS_AGAINST_PER_GAME 1402 THREE_POINTER_POSITION 1402 TWO_POINTERS_PER_GAME 1844 TWO_POINTER_AGAINST_PER_GAME 1844 TWO_POINTER_PERCENTAGE 1844 FREE_THROW_PER_GAME 1844 FREE_THROW_AGAINST_PER_GAME 1844 FREE_THROW_PERCENTAGE 1844 OFFENSIVE_Rebound_per_game 1515 DEFENSIVE_REBOUND_PER_GAME 1515 TOTAL_REBOUND_PER_GAME 1792 ASSIST_PER_GAME 1844 STEALS_PER_GAME 1456 BLOCKS_PER_GAME 1456 TURNOVER_PER_GAME 1527 PERSONAL_FOULS_PER_GAME 1844 POINTS_PER_GAME 1844 W 1757 L 1757 dtype: int64
# Doing a Heatmap to determine which variables are mostly related with the number of
# Wins versus teh number of Loss
# Calculate the correlation matrix
correlation_matrix = team_statPG_df.corr(numeric_only=True)
# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
#plt.show()
# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)
print("Correlation with Wins:")
print(correlation_with_wins)
print("\nCorrelation with Losses:")
print(correlation_with_losses)
Correlation with Wins: W 1.000000 PLAYOFFS 0.718380 FIELD_GOAL_PERCENTAGE 0.385399 ASSIST_PER_GAME 0.316808 TWO_POINTER_PERCENTAGE 0.316198 GAMES 0.313225 POINTS_PER_GAME 0.303466 fg_per_game 0.268725 DEFENSIVE_REBOUND_PER_GAME 0.265481 BLOCKS_PER_GAME 0.218514 FREE_THROW_PERCENTAGE 0.200805 TWO_POINTERS_PER_GAME 0.157563 STEALS_PER_GAME 0.147886 THREE_POINTER_POSITION 0.133557 FREE_THROW_PER_GAME 0.119712 FREE_THROW_AGAINST_PER_GAME 0.066455 TOTAL_REBOUND_PER_GAME 0.066041 SEASON 0.035656 THREE_POINTERS_PER_GAME 0.015291 mp_per_game 0.000792 FIELD_GOAL_AGAINST_PER_GAME -0.013688 THREE_POINTERS_AGAINST_PER_GAME -0.018949 TWO_POINTER_AGAINST_PER_GAME -0.019466 OFFENSIVE_Rebound_per_game -0.021232 TURNOVER_PER_GAME -0.096362 PERSONAL_FOULS_PER_GAME -0.132005 L -0.814865 Name: W, dtype: float64 Correlation with Losses: L 1.000000 GAMES 0.295246 TURNOVER_PER_GAME 0.194289 PERSONAL_FOULS_PER_GAME 0.157826 OFFENSIVE_Rebound_per_game 0.108481 TWO_POINTER_AGAINST_PER_GAME 0.057241 SEASON 0.035872 FIELD_GOAL_AGAINST_PER_GAME 0.031538 TWO_POINTERS_PER_GAME 0.023603 mp_per_game -0.012973 fg_per_game -0.045643 FREE_THROW_AGAINST_PER_GAME -0.079196 STEALS_PER_GAME -0.093570 FIELD_GOAL_PERCENTAGE -0.101884 TOTAL_REBOUND_PER_GAME -0.105894 FREE_THROW_PER_GAME -0.107001 FREE_THROW_PERCENTAGE -0.107084 POINTS_PER_GAME -0.115425 TWO_POINTER_PERCENTAGE -0.116543 THREE_POINTERS_AGAINST_PER_GAME -0.124966 ASSIST_PER_GAME -0.150416 THREE_POINTERS_PER_GAME -0.157817 THREE_POINTER_POSITION -0.200552 BLOCKS_PER_GAME -0.210151 DEFENSIVE_REBOUND_PER_GAME -0.308427 PLAYOFFS -0.692873 W -0.814865 Name: L, dtype: float64
team_statPG_df.dropna(subset=['W', 'L'], inplace=True)
print(team_statPG_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 GAMES 1757 mp_per_game 1586 fg_per_game 1757 FIELD_GOAL_AGAINST_PER_GAME 1757 FIELD_GOAL_PERCENTAGE 1757 THREE_POINTERS_PER_GAME 1348 THREE_POINTERS_AGAINST_PER_GAME 1348 THREE_POINTER_POSITION 1348 TWO_POINTERS_PER_GAME 1757 TWO_POINTER_AGAINST_PER_GAME 1757 TWO_POINTER_PERCENTAGE 1757 FREE_THROW_PER_GAME 1757 FREE_THROW_AGAINST_PER_GAME 1757 FREE_THROW_PERCENTAGE 1757 OFFENSIVE_Rebound_per_game 1456 DEFENSIVE_REBOUND_PER_GAME 1456 TOTAL_REBOUND_PER_GAME 1709 ASSIST_PER_GAME 1757 STEALS_PER_GAME 1402 BLOCKS_PER_GAME 1402 TURNOVER_PER_GAME 1467 PERSONAL_FOULS_PER_GAME 1757 POINTS_PER_GAME 1757 W 1757 L 1757 dtype: int64
#Drop columns with low correlation
team_statPG_df.drop(columns=['TWO_POINTERS_PER_GAME', 'STEALS_PER_GAME',
'THREE_POINTER_POSITION', 'FREE_THROW_PER_GAME', 'FREE_THROW_AGAINST_PER_GAME',
'TOTAL_REBOUND_PER_GAME', 'THREE_POINTERS_PER_GAME',
'mp_per_game', 'FIELD_GOAL_AGAINST_PER_GAME', 'THREE_POINTERS_AGAINST_PER_GAME',
'TWO_POINTER_AGAINST_PER_GAME', 'OFFENSIVE_Rebound_per_game', 'TURNOVER_PER_GAME',
'PERSONAL_FOULS_PER_GAME'], inplace=True)
print(team_statPG_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 GAMES 1757 fg_per_game 1757 FIELD_GOAL_PERCENTAGE 1757 TWO_POINTER_PERCENTAGE 1757 FREE_THROW_PERCENTAGE 1757 DEFENSIVE_REBOUND_PER_GAME 1456 ASSIST_PER_GAME 1757 BLOCKS_PER_GAME 1402 POINTS_PER_GAME 1757 W 1757 L 1757 dtype: int64
Here we are imputing using the mean not of teh entire dataframe but for each missing value, we impute with the mean of the corresponding team name.
#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')
# Define the columns to impute
columns_to_impute = ['DEFENSIVE_REBOUND_PER_GAME', 'BLOCKS_PER_GAME']
# Initialize the imputer
imputer = SimpleImputer(strategy='mean')
# Iterate over each column to impute
for column in columns_to_impute:
# Group by 'TEAM' and apply the imputer to each group
team_statPG_df[column] = team_statPG_df.groupby('TEAM')[column].transform(
lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
)
# Print the count of non-null values after imputation
print(team_statPG_df.notnull().sum())
SEASON 1757 LEAGUE 1757 TEAM 1757 ABBREVIATION 1757 PLAYOFFS 1757 GAMES 1757 fg_per_game 1757 FIELD_GOAL_PERCENTAGE 1757 TWO_POINTER_PERCENTAGE 1757 FREE_THROW_PERCENTAGE 1757 DEFENSIVE_REBOUND_PER_GAME 1605 ASSIST_PER_GAME 1757 BLOCKS_PER_GAME 1573 POINTS_PER_GAME 1757 W 1757 L 1757 dtype: int64
#Drop rows where teams did not have any data to help with imputation
team_statPG_df.dropna(subset=['DEFENSIVE_REBOUND_PER_GAME'], inplace=True)
print(team_statPG_df.notnull().sum())
SEASON 1605 LEAGUE 1605 TEAM 1605 ABBREVIATION 1605 PLAYOFFS 1605 GAMES 1605 fg_per_game 1605 FIELD_GOAL_PERCENTAGE 1605 TWO_POINTER_PERCENTAGE 1605 FREE_THROW_PERCENTAGE 1605 DEFENSIVE_REBOUND_PER_GAME 1605 ASSIST_PER_GAME 1605 BLOCKS_PER_GAME 1573 POINTS_PER_GAME 1605 W 1605 L 1605 dtype: int64
#Drop Columns Blocks_per_game as we already have Blocks in the team_summary df
team_statPG_df.drop(columns=['BLOCKS_PER_GAME'], inplace=True)
print(team_statPG_df.notnull().sum())
SEASON 1605 LEAGUE 1605 TEAM 1605 ABBREVIATION 1605 PLAYOFFS 1605 GAMES 1605 fg_per_game 1605 FIELD_GOAL_PERCENTAGE 1605 TWO_POINTER_PERCENTAGE 1605 FREE_THROW_PERCENTAGE 1605 DEFENSIVE_REBOUND_PER_GAME 1605 ASSIST_PER_GAME 1605 POINTS_PER_GAME 1605 W 1605 L 1605 dtype: int64
Merging the teams CSVs based on the year and teh team name (team_id is unfortunately not on all CSVs)
# Initial merge (full outer join) between teams_df and team_totals_df
merged_team_df = pd.merge(team_totals_df, team_sum_df, on=['TEAM', 'SEASON'], how='outer')
merged_team_df = pd.merge(merged_team_df, team_statPG_df, on=['TEAM', 'SEASON'], how='outer')
merged_team_df.head(10)
| SEASON | LEAGUE_x | TEAM | ABBREVIATION_x | PLAYOFFS_x | GAMES_x | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOALS_ATTEMPTED | FIELD_GOALS_PERCENTAGE | ... | GAMES_y | fg_per_game | FIELD_GOAL_PERCENTAGE | TWO_POINTER_PERCENTAGE | FREE_THROW_PERCENTAGE_y | DEFENSIVE_REBOUND_PER_GAME | ASSIST_PER_GAME | POINTS_PER_GAME | W | L | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024 | NBA | Atlanta Hawks | ATL | False | 55.0 | 13300.0 | 2397.0 | 5144.0 | 0.466 | ... | 55.0 | 43.6 | 0.466 | 0.539 | 0.813 | 31.9 | 26.2 | 121.3 | 24.0 | 31.0 |
| 1 | 2024 | NBA | Boston Celtics | BOS | False | 55.0 | 13325.0 | 2391.0 | 4967.0 | 0.481 | ... | 55.0 | 43.5 | 0.481 | 0.572 | 0.808 | 36.6 | 26.2 | 120.7 | 43.0 | 12.0 |
| 2 | 2024 | NBA | Brooklyn Nets | BRK | False | 54.0 | 13035.0 | 2257.0 | 4902.0 | 0.460 | ... | 54.0 | 41.8 | 0.460 | 0.525 | 0.759 | 32.8 | 26.8 | 113.4 | 21.0 | 33.0 |
| 3 | 2024 | NBA | Chicago Bulls | CHI | False | 55.0 | 13400.0 | 2292.0 | 4904.0 | 0.467 | ... | 55.0 | 41.7 | 0.467 | 0.529 | 0.791 | 32.4 | 24.3 | 111.8 | 26.0 | 29.0 |
| 4 | 2024 | NBA | Charlotte Hornets | CHO | False | 54.0 | 13010.0 | 2206.0 | 4790.0 | 0.461 | ... | 54.0 | 40.9 | 0.461 | 0.519 | 0.788 | 31.0 | 24.9 | 108.6 | 13.0 | 41.0 |
| 5 | 2024 | NBA | Cleveland Cavaliers | CLE | False | 53.0 | 12770.0 | 2267.0 | 4698.0 | 0.483 | ... | 53.0 | 42.8 | 0.483 | 0.570 | 0.772 | 34.2 | 27.3 | 114.9 | 36.0 | 17.0 |
| 6 | 2024 | NBA | Dallas Mavericks | DAL | False | 55.0 | 13200.0 | 2362.0 | 4937.0 | 0.478 | ... | 55.0 | 42.9 | 0.478 | 0.565 | 0.755 | 32.3 | 25.6 | 118.7 | 32.0 | 23.0 |
| 7 | 2024 | NBA | Denver Nuggets | DEN | False | 55.0 | 13200.0 | 2388.0 | 4883.0 | 0.489 | ... | 55.0 | 43.4 | 0.489 | 0.554 | 0.756 | 33.0 | 28.6 | 113.9 | 36.0 | 19.0 |
| 8 | 2024 | NBA | Detroit Pistons | DET | False | 54.0 | 13035.0 | 2281.0 | 4836.0 | 0.472 | ... | 54.0 | 42.2 | 0.472 | 0.531 | 0.778 | 32.4 | 26.5 | 112.9 | 8.0 | 46.0 |
| 9 | 2024 | NBA | Golden State Warriors | GSW | False | 53.0 | 12870.0 | 2311.0 | 4871.0 | 0.474 | ... | 53.0 | 43.6 | 0.474 | 0.546 | 0.787 | 34.3 | 29.0 | 119.7 | 27.0 | 26.0 |
10 rows × 56 columns
#Just sanity checking
print(team_totals_df.shape)
print(team_sum_df.shape)
print(team_statPG_df.shape)
print(merged_team_df.shape)
(1605, 24) (1608, 21) (1605, 15) (1608, 56)
merged_team_columns = merged_team_df.columns.tolist()
print(merged_team_columns)
['SEASON', 'LEAGUE_x', 'TEAM', 'ABBREVIATION_x', 'PLAYOFFS_x', 'GAMES_x', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOALS_ATTEMPTED', 'FIELD_GOALS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROWS_ATTEMPTED', 'FREE_THROW_PERCENTAGE_x', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'ASSISTS', 'STEALS', 'BLOCKS', 'TURNOVERS', 'POINTS', 'W_x', 'L_x', 'LEAGUE_y', 'ABBREVIATION_y', 'PLAYOFFS_y', 'AGE', 'W_y', 'L_y', 'PW', 'PL', 'MARGIN_OF_VICTORY', 'STRENGTH_OF_SCHEDULE', 'SIMPLE_RATING_SYSTEM', 'OFFENSIVE_RATING', 'DEFENSIVE_RATING', 'NET_RATING', 'TRUE_SHOOTING_PERCENTAGE', 'EFFECTIEV_FIELD_GOAL_PERCENTAGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL', 'OPPONENT_FREE_THROW_PERCENTAGE.1', 'LEAGUE', 'ABBREVIATION', 'PLAYOFFS', 'GAMES_y', 'fg_per_game', 'FIELD_GOAL_PERCENTAGE', 'TWO_POINTER_PERCENTAGE', 'FREE_THROW_PERCENTAGE_y', 'DEFENSIVE_REBOUND_PER_GAME', 'ASSIST_PER_GAME', 'POINTS_PER_GAME', 'W', 'L']
print(merged_team_df.notnull().sum())
SEASON 1608 LEAGUE_x 1605 TEAM 1608 ABBREVIATION_x 1605 PLAYOFFS_x 1605 GAMES_x 1605 MINUTES_PLAYED 1605 FIELD_GOALS_MADE 1605 FIELD_GOALS_ATTEMPTED 1605 FIELD_GOALS_PERCENTAGE 1605 TWO_POINTERS_MADE 1605 TWO_POINTERS_PERCENTAGE 1605 FREE_THROWS_MADE 1605 FREE_THROWS_ATTEMPTED 1605 FREE_THROW_PERCENTAGE_x 1605 DEFENSIVE_REBOUND 1605 TOTAL_REBOUNDS 1605 ASSISTS 1605 STEALS 1605 BLOCKS 1605 TURNOVERS 1605 POINTS 1605 W_x 1605 L_x 1605 LEAGUE_y 1608 ABBREVIATION_y 1608 PLAYOFFS_y 1608 AGE 1608 W_y 1608 L_y 1608 PW 1608 PL 1608 MARGIN_OF_VICTORY 1608 STRENGTH_OF_SCHEDULE 1608 SIMPLE_RATING_SYSTEM 1608 OFFENSIVE_RATING 1608 DEFENSIVE_RATING 1608 NET_RATING 1608 TRUE_SHOOTING_PERCENTAGE 1608 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1608 TURNOVER_PERCENTAGE 1608 OPPONENT_EFFECTIVE_FIELD_GAL 1608 OPPONENT_FREE_THROW_PERCENTAGE.1 1608 LEAGUE 1605 ABBREVIATION 1605 PLAYOFFS 1605 GAMES_y 1605 fg_per_game 1605 FIELD_GOAL_PERCENTAGE 1605 TWO_POINTER_PERCENTAGE 1605 FREE_THROW_PERCENTAGE_y 1605 DEFENSIVE_REBOUND_PER_GAME 1605 ASSIST_PER_GAME 1605 POINTS_PER_GAME 1605 W 1605 L 1605 dtype: int64
merged_team_df.dropna(subset=['W_x', 'L_x'], inplace=True)
print(merged_team_df.notnull().sum())
SEASON 1605 LEAGUE_x 1605 TEAM 1605 ABBREVIATION_x 1605 PLAYOFFS_x 1605 GAMES_x 1605 MINUTES_PLAYED 1605 FIELD_GOALS_MADE 1605 FIELD_GOALS_ATTEMPTED 1605 FIELD_GOALS_PERCENTAGE 1605 TWO_POINTERS_MADE 1605 TWO_POINTERS_PERCENTAGE 1605 FREE_THROWS_MADE 1605 FREE_THROWS_ATTEMPTED 1605 FREE_THROW_PERCENTAGE_x 1605 DEFENSIVE_REBOUND 1605 TOTAL_REBOUNDS 1605 ASSISTS 1605 STEALS 1605 BLOCKS 1605 TURNOVERS 1605 POINTS 1605 W_x 1605 L_x 1605 LEAGUE_y 1605 ABBREVIATION_y 1605 PLAYOFFS_y 1605 AGE 1605 W_y 1605 L_y 1605 PW 1605 PL 1605 MARGIN_OF_VICTORY 1605 STRENGTH_OF_SCHEDULE 1605 SIMPLE_RATING_SYSTEM 1605 OFFENSIVE_RATING 1605 DEFENSIVE_RATING 1605 NET_RATING 1605 TRUE_SHOOTING_PERCENTAGE 1605 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1605 TURNOVER_PERCENTAGE 1605 OPPONENT_EFFECTIVE_FIELD_GAL 1605 OPPONENT_FREE_THROW_PERCENTAGE.1 1605 LEAGUE 1605 ABBREVIATION 1605 PLAYOFFS 1605 GAMES_y 1605 fg_per_game 1605 FIELD_GOAL_PERCENTAGE 1605 TWO_POINTER_PERCENTAGE 1605 FREE_THROW_PERCENTAGE_y 1605 DEFENSIVE_REBOUND_PER_GAME 1605 ASSIST_PER_GAME 1605 POINTS_PER_GAME 1605 W 1605 L 1605 dtype: int64
#Drop repeating columns
merged_team_df.drop(columns=['W_x', 'W_x',
'W_y', 'L_y', 'FREE_THROW_PERCENTAGE_y', 'GAMES_y', 'LEAGUE_y',
'ABBREVIATION_y'], inplace=True)
print(merged_team_df.notnull().sum())
SEASON 1605 LEAGUE_x 1605 TEAM 1605 ABBREVIATION_x 1605 PLAYOFFS_x 1605 GAMES_x 1605 MINUTES_PLAYED 1605 FIELD_GOALS_MADE 1605 FIELD_GOALS_ATTEMPTED 1605 FIELD_GOALS_PERCENTAGE 1605 TWO_POINTERS_MADE 1605 TWO_POINTERS_PERCENTAGE 1605 FREE_THROWS_MADE 1605 FREE_THROWS_ATTEMPTED 1605 FREE_THROW_PERCENTAGE_x 1605 DEFENSIVE_REBOUND 1605 TOTAL_REBOUNDS 1605 ASSISTS 1605 STEALS 1605 BLOCKS 1605 TURNOVERS 1605 POINTS 1605 L_x 1605 PLAYOFFS_y 1605 AGE 1605 PW 1605 PL 1605 MARGIN_OF_VICTORY 1605 STRENGTH_OF_SCHEDULE 1605 SIMPLE_RATING_SYSTEM 1605 OFFENSIVE_RATING 1605 DEFENSIVE_RATING 1605 NET_RATING 1605 TRUE_SHOOTING_PERCENTAGE 1605 EFFECTIEV_FIELD_GOAL_PERCENTAGE 1605 TURNOVER_PERCENTAGE 1605 OPPONENT_EFFECTIVE_FIELD_GAL 1605 OPPONENT_FREE_THROW_PERCENTAGE.1 1605 LEAGUE 1605 ABBREVIATION 1605 PLAYOFFS 1605 fg_per_game 1605 FIELD_GOAL_PERCENTAGE 1605 TWO_POINTER_PERCENTAGE 1605 DEFENSIVE_REBOUND_PER_GAME 1605 ASSIST_PER_GAME 1605 POINTS_PER_GAME 1605 W 1605 L 1605 dtype: int64
# Doing a Heatmap to determine which variables are mostly related with the number of
# Wins versus teh number of Loss
# Calculate the correlation matrix
correlation_matrix = merged_team_df.corr(numeric_only=True)
# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)
print("Correlation with Wins:")
print(correlation_with_wins)
print("\nCorrelation with Losses:")
print(correlation_with_losses)
#Might still wanna drop TURNOVER_PERCENTAGE and FREE_THROW_PERCENTAGE_x
Correlation with Wins: W 1.000000 PW 0.970126 MARGIN_OF_VICTORY 0.925597 SIMPLE_RATING_SYSTEM 0.923119 NET_RATING 0.919308 PLAYOFFS_y 0.739580 AGE 0.439641 FIELD_GOAL_PERCENTAGE 0.404188 FIELD_GOALS_PERCENTAGE 0.404188 ASSISTS 0.401628 POINTS 0.388451 DEFENSIVE_REBOUND 0.380249 OFFENSIVE_RATING 0.371771 TRUE_SHOOTING_PERCENTAGE 0.353290 FIELD_GOALS_MADE 0.343076 EFFECTIEV_FIELD_GOAL_PERCENTAGE 0.311990 ASSIST_PER_GAME 0.300734 TWO_POINTER_PERCENTAGE 0.299908 TWO_POINTERS_PERCENTAGE 0.299908 GAMES_x 0.289034 POINTS_PER_GAME 0.284400 MINUTES_PLAYED 0.273111 FREE_THROWS_MADE 0.271898 BLOCKS 0.264094 TOTAL_REBOUNDS 0.259601 DEFENSIVE_REBOUND_PER_GAME 0.246202 fg_per_game 0.241840 FREE_THROWS_ATTEMPTED 0.235512 TWO_POINTERS_MADE 0.224873 FIELD_GOALS_ATTEMPTED 0.209035 STEALS 0.207923 FREE_THROW_PERCENTAGE_x 0.159023 TURNOVERS 0.037907 SEASON -0.041475 TURNOVER_PERCENTAGE -0.132646 OPPONENT_FREE_THROW_PERCENTAGE.1 -0.193867 OPPONENT_EFFECTIVE_FIELD_GAL -0.387089 DEFENSIVE_RATING -0.392106 STRENGTH_OF_SCHEDULE -0.461399 PL -0.800330 L_x -0.839122 L -0.839122 Name: W, dtype: float64 Correlation with Losses: L 1.000000 L_x 1.000000 PL 0.969897 STRENGTH_OF_SCHEDULE 0.475642 DEFENSIVE_RATING 0.375201 OPPONENT_EFFECTIVE_FIELD_GAL 0.316313 GAMES_x 0.278192 MINUTES_PLAYED 0.253258 OPPONENT_FREE_THROW_PERCENTAGE.1 0.253139 TURNOVERS 0.252481 FIELD_GOALS_ATTEMPTED 0.224683 TURNOVER_PERCENTAGE 0.194887 TWO_POINTERS_MADE 0.103275 FIELD_GOALS_MADE 0.082420 FREE_THROWS_ATTEMPTED 0.063406 POINTS 0.055874 STEALS 0.039067 TOTAL_REBOUNDS 0.037651 FREE_THROWS_MADE 0.036751 SEASON -0.012133 DEFENSIVE_REBOUND -0.021783 ASSISTS -0.036554 BLOCKS -0.061461 fg_per_game -0.122568 FREE_THROW_PERCENTAGE_x -0.141567 POINTS_PER_GAME -0.208133 ASSIST_PER_GAME -0.227411 FIELD_GOAL_PERCENTAGE -0.247226 FIELD_GOALS_PERCENTAGE -0.247226 TWO_POINTER_PERCENTAGE -0.251151 TWO_POINTERS_PERCENTAGE -0.251151 EFFECTIEV_FIELD_GOAL_PERCENTAGE -0.264743 TRUE_SHOOTING_PERCENTAGE -0.299309 DEFENSIVE_REBOUND_PER_GAME -0.303271 OFFENSIVE_RATING -0.384010 AGE -0.448899 PLAYOFFS_y -0.692145 PW -0.803704 W -0.839122 NET_RATING -0.913882 SIMPLE_RATING_SYSTEM -0.915752 MARGIN_OF_VICTORY -0.919776 Name: L, dtype: float64
merged_team_df.columns.tolist()
['SEASON', 'LEAGUE_x', 'TEAM', 'ABBREVIATION_x', 'PLAYOFFS_x', 'GAMES_x', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOALS_ATTEMPTED', 'FIELD_GOALS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROWS_ATTEMPTED', 'FREE_THROW_PERCENTAGE_x', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'ASSISTS', 'STEALS', 'BLOCKS', 'TURNOVERS', 'POINTS', 'L_x', 'PLAYOFFS_y', 'AGE', 'PW', 'PL', 'MARGIN_OF_VICTORY', 'STRENGTH_OF_SCHEDULE', 'SIMPLE_RATING_SYSTEM', 'OFFENSIVE_RATING', 'DEFENSIVE_RATING', 'NET_RATING', 'TRUE_SHOOTING_PERCENTAGE', 'EFFECTIEV_FIELD_GOAL_PERCENTAGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL', 'OPPONENT_FREE_THROW_PERCENTAGE.1', 'LEAGUE', 'ABBREVIATION', 'PLAYOFFS', 'fg_per_game', 'FIELD_GOAL_PERCENTAGE', 'TWO_POINTER_PERCENTAGE', 'DEFENSIVE_REBOUND_PER_GAME', 'ASSIST_PER_GAME', 'POINTS_PER_GAME', 'W', 'L']
#Working with the games CSVs
game_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/game.csv")
games_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/games.csv")
# upon examining the heatmap of teh relatioon between this datasets' columns and
# the team wins is very low. We will thus not be using it.
games_details_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/games_details.csv")
<ipython-input-110-31b392da8928>:7: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
games_details_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/games_details.csv")
print(games_details_df.dtypes)
GAME_ID int64 TEAM_ID int64 TEAM_ABBREVIATION object TEAM_CITY object PLAYER_ID int64 PLAYER_NAME object NICKNAME object START_POSITION object COMMENT object MINUTES object FIELD_GOALS_MADE float64 FIELD_GOALS_ATTEMPTED float64 FIELD_GOAL_PERCENTAGE float64 THREE_POINTERS_MADE float64 THREE_POINTERS_ATTEMPTED float64 THREE_POINTERS_PERCENTAGE float64 FREE_THROWS_MADE float64 FREE_THROWS_ATTEMPTED float64 FREE_THROWS_PERCENTAGE float64 OFFENSIVE_REBOUND float64 DEFENSIVE_REBOUND float64 TOTAL_REBOUNDS float64 ASSITS float64 STEALS float64 BLOCKS float64 TURNOVERS float64 PERSONAL_FOULS float64 TOTAL_POINTS float64 PLUS_MINUS float64 dtype: object
#There are irregularities in the minutes format.
def convert_to_mmss(time_str):
# Check if the value is NaN
if pd.isna(time_str):
return None # Return None for missing values
# Split time string into hours, minutes, and seconds
parts = time_str.split(':')
# Check if the parts list has only one element (minutes)
if len(parts) == 1:
minutes = parts[0]
seconds = '00' # Set seconds to '00'
elif len(parts) >= 2:
minutes = parts[-2]
seconds = parts[-1]
else:
return None # Return None if the format is not as expected
# Concatenate minutes and seconds with a colon separator
mm_ss_format = f"{minutes}:{seconds}"
return mm_ss_format
# Apply the function to the 'MINUTES' column
games_details_df['MINUTES'] = games_details_df['MINUTES'].apply(convert_to_mmss)
print(games_details_df.dtypes)
GAME_ID int64 TEAM_ID int64 TEAM_ABBREVIATION object TEAM_CITY object PLAYER_ID int64 PLAYER_NAME object NICKNAME object START_POSITION object COMMENT object MINUTES object FIELD_GOALS_MADE float64 FIELD_GOALS_ATTEMPTED float64 FIELD_GOAL_PERCENTAGE float64 THREE_POINTERS_MADE float64 THREE_POINTERS_ATTEMPTED float64 THREE_POINTERS_PERCENTAGE float64 FREE_THROWS_MADE float64 FREE_THROWS_ATTEMPTED float64 FREE_THROWS_PERCENTAGE float64 OFFENSIVE_REBOUND float64 DEFENSIVE_REBOUND float64 TOTAL_REBOUNDS float64 ASSITS float64 STEALS float64 BLOCKS float64 TURNOVERS float64 PERSONAL_FOULS float64 TOTAL_POINTS float64 PLUS_MINUS float64 dtype: object
print(games_df.dtypes)
GAME_DATE_EST object GAME_ID int64 GAME_STATUS_TEXT object HOME_TEAM_ID int64 VISITOR_TEAM_ID int64 SEASON int64 TEAM_ID_HOME int64 POINTS_HOME float64 FIELD_GOALS_PERCENTAGE_HOME float64 FREE_THROWS_PERCENTAGE_HOME float64 THREE_POINTERS_PERCENTAGE_HOME float64 ASSISTS_HOME float64 TOTAL_REBOUNDS_HOME float64 TEAM_ID_AWAY int64 POINTS_AWAY float64 FIELD_GOALS_PERCENTAGE_AWAY float64 FREE_THROWS_PERCENTAGE_AWAY float64 THREE_POINTERS_PERCENTAGE_AWAY float64 ASSISTS_AWAY float64 TOTAL_REBOUNDS_AWAY float64 HOME_TEAM_WINS int64 dtype: object
print(game_df.dtypes)
SEASON_ID int64 TEAM_ID_HOME int64 TEAM_ABBREVIATION_HOME object TEAM_NAME_HOME object GAME_ID int64 GAME_DATE object MATCHUP_HOME object WIN_LOSS_HOME object MINUTES int64 FIELD_GOALS_MADE_HOME float64 FIELD_GOALS_ATTEMPTED_HOME float64 FIELD_GOALS_PERCENTAGE_HOME float64 THREE_POINTERS_HOME float64 THREE_POINTERS_ATTEMPTED_HOME float64 THREE_POINTERS_PERCENTAGE_HOME float64 FREE_THROWS_MADE_HOME float64 FREE_THROWS_ATTEMPTED_HOME float64 FREE_THROWS_PERCENTAGE_HOME float64 OFFENSIVE_REBOUND_HOME float64 DEFENSIVE_REBOUND_HOME float64 TOTAL_REBOUNDS_HOME float64 ASSISTS_HOME float64 STEALS_HOME float64 BLOCKS_HOME float64 TURNOVERS_HOME float64 PERSONAL_FOULS_HOME float64 TOTAL_POINTS_HOME int64 PLUS_MINUS_HOME int64 TEAM_ID_AWAY int64 TEAM_ABBREVIATION_AWAY object TEAM_NAME_AWAY object MATCHUP_AWAY object WIN_LOSS_AWAY object MINUTES.1 float64 FIELD_GOALS_MADE_AWAY float64 FIELD_GOALS_ATTEMPTED_AWAY float64 FIELD_GOALS_PERCENTAGE_AWAY float64 THREE_POINTERS_WAY float64 THREE_POINTERS_ATTEMPTED_AWAY float64 THREE_POINTERS_PERCENTAGE_AWAY float64 FREE_THROWS_MADE_AWAY float64 FREE_THROWS_ATTEMPTED_AWAY float64 FREE_THROWS_PERCENTAGE_AWAY float64 OFFENSIVE_REBOUND_AWAY float64 DEFENSIVE_REBOUND_AWAY float64 TOTAL_REBOUNDS_AWAY float64 ASSISTS_AWAY float64 STEALS_AWAY float64 BLOCKS_AWAY float64 TURNOVERS_AWAY float64 PERSONAL_FOULS_AWAY int64 TOTAL_POINTS_AWAY int64 PLUS_MINUS_AWAY int64 SEASON_TYPE object dtype: object
# Do a left merge on the dataframes on GAME_ID
# This is done to assert all dataframes have a corresponding win and loss column
game_df = pd.merge(game_df, games_df[['GAME_ID', 'HOME_TEAM_WINS']], on=['GAME_ID'], how='left')
# Display the merged dataframe
print(game_df.notnull().sum())
#For some reason, game has only a few matches for home team wins
SEASON_ID 65727 TEAM_ID_HOME 65727 TEAM_ABBREVIATION_HOME 65727 TEAM_NAME_HOME 65727 GAME_ID 65727 GAME_DATE 65727 MATCHUP_HOME 65727 WIN_LOSS_HOME 65725 MINUTES 65727 FIELD_GOALS_MADE_HOME 65714 FIELD_GOALS_ATTEMPTED_HOME 50280 FIELD_GOALS_PERCENTAGE_HOME 50237 THREE_POINTERS_HOME 52509 THREE_POINTERS_ATTEMPTED_HOME 47044 THREE_POINTERS_PERCENTAGE_HOME 46653 FREE_THROWS_MADE_HOME 65711 FREE_THROWS_ATTEMPTED_HOME 62723 FREE_THROWS_PERCENTAGE_HOME 62718 OFFENSIVE_REBOUND_HOME 46791 DEFENSIVE_REBOUND_HOME 46728 TOTAL_REBOUNDS_HOME 49998 ASSISTS_HOME 49922 STEALS_HOME 46878 BLOCKS_HOME 47101 TURNOVERS_HOME 47043 PERSONAL_FOULS_HOME 62871 TOTAL_POINTS_HOME 65727 PLUS_MINUS_HOME 65727 TEAM_ID_AWAY 65727 TEAM_ABBREVIATION_AWAY 65727 TEAM_NAME_AWAY 65727 MATCHUP_AWAY 65727 WIN_LOSS_AWAY 65725 MINUTES.1 65714 FIELD_GOALS_MADE_AWAY 50280 FIELD_GOALS_ATTEMPTED_AWAY 50238 FIELD_GOALS_PERCENTAGE_AWAY 52509 THREE_POINTERS_WAY 47044 THREE_POINTERS_ATTEMPTED_AWAY 46765 THREE_POINTERS_PERCENTAGE_AWAY 65714 FREE_THROWS_MADE_AWAY 62723 FREE_THROWS_ATTEMPTED_AWAY 62721 FREE_THROWS_PERCENTAGE_AWAY 46791 OFFENSIVE_REBOUND_AWAY 46729 DEFENSIVE_REBOUND_AWAY 50002 TOTAL_REBOUNDS_AWAY 49926 ASSISTS_AWAY 46878 STEALS_AWAY 47102 BLOCKS_AWAY 47042 TURNOVERS_AWAY 62876 PERSONAL_FOULS_AWAY 65727 TOTAL_POINTS_AWAY 65727 PLUS_MINUS_AWAY 65727 SEASON_TYPE 65727 HOME_TEAM_WINS 24986 dtype: int64
#This dataframe has been dropped
# Do a left merge on the dataframes on GAME_ID
# This is done to assert all dataframes have a corresponding win and loss column
games_details_df = pd.merge(games_details_df, games_df[['GAME_ID', 'HOME_TEAM_WINS']], on=['GAME_ID'], how='left')
# Display the merged dataframe
print(games_details_df.notnull().sum())
GAME_ID 669560 TEAM_ID 669560 TEAM_ABBREVIATION 669560 TEAM_CITY 669560 PLAYER_ID 669560 PLAYER_NAME 669560 NICKNAME 53037 START_POSITION 256104 COMMENT 109881 MINUTES 559678 FIELD_GOALS_MADE 559678 FIELD_GOALS_ATTEMPTED 559678 FIELD_GOAL_PERCENTAGE 559678 THREE_POINTERS_MADE 559678 THREE_POINTERS_ATTEMPTED 559678 THREE_POINTERS_PERCENTAGE 559678 FREE_THROWS_MADE 559678 FREE_THROWS_ATTEMPTED 559678 FREE_THROWS_PERCENTAGE 559678 OFFENSIVE_REBOUND 559678 DEFENSIVE_REBOUND 559678 TOTAL_REBOUNDS 559678 ASSITS 559678 STEALS 559678 BLOCKS 559678 TURNOVERS 559678 PERSONAL_FOULS 559678 TOTAL_POINTS 559678 PLUS_MINUS 536017 HOME_TEAM_WINS 669560 dtype: int64
# Doing a Heatmap to determine which variables are mostly related with the number of
# Home wins
# Calculate the correlation matrix
correlation_matrix = games_df.corr(numeric_only=True)
# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
# Sorting correlations with respect to HOME_TEAM_WINS
correlation_with_wins = correlation_matrix['HOME_TEAM_WINS'].sort_values(ascending=False)
print("Correlation with HOME_TEAM_WINS:")
print(correlation_with_wins)
Correlation with HOME_TEAM_WINS: HOME_TEAM_WINS 1.000000 FIELD_GOALS_PERCENTAGE_HOME 0.432700 POINTS_HOME 0.395810 ASSISTS_HOME 0.301750 THREE_POINTERS_PERCENTAGE_HOME 0.301669 TOTAL_REBOUNDS_HOME 0.245884 FREE_THROWS_PERCENTAGE_HOME 0.093394 VISITOR_TEAM_ID 0.032002 TEAM_ID_AWAY 0.032002 GAME_ID 0.030804 TEAM_ID_HOME -0.026123 HOME_TEAM_ID -0.026123 SEASON -0.027849 FREE_THROWS_PERCENTAGE_AWAY -0.108255 TOTAL_REBOUNDS_AWAY -0.254746 THREE_POINTERS_PERCENTAGE_AWAY -0.291838 ASSISTS_AWAY -0.295013 POINTS_AWAY -0.414452 FIELD_GOALS_PERCENTAGE_AWAY -0.436201 Name: HOME_TEAM_WINS, dtype: float64
#drop columns with a low correlation coefficient that is neither very positive nor very negative
games_df.drop(columns=['FREE_THROWS_PERCENTAGE_HOME', 'VISITOR_TEAM_ID',
'TEAM_ID_AWAY', 'FREE_THROWS_PERCENTAGE_AWAY',
], inplace=True)
print(games_details_df.dtypes)
GAME_ID int64 TEAM_ID int64 TEAM_ABBREVIATION object TEAM_CITY object PLAYER_ID int64 PLAYER_NAME object NICKNAME object START_POSITION object COMMENT object MINUTES object FIELD_GOALS_MADE float64 FIELD_GOALS_ATTEMPTED float64 FIELD_GOAL_PERCENTAGE float64 THREE_POINTERS_MADE float64 THREE_POINTERS_ATTEMPTED float64 THREE_POINTERS_PERCENTAGE float64 FREE_THROWS_MADE float64 FREE_THROWS_ATTEMPTED float64 FREE_THROWS_PERCENTAGE float64 OFFENSIVE_REBOUND float64 DEFENSIVE_REBOUND float64 TOTAL_REBOUNDS float64 ASSITS float64 STEALS float64 BLOCKS float64 TURNOVERS float64 PERSONAL_FOULS float64 TOTAL_POINTS float64 PLUS_MINUS float64 HOME_TEAM_WINS int64 dtype: object
print(games_df.notnull().sum())
GAME_DATE_EST 26651 GAME_ID 26651 GAME_STATUS_TEXT 26651 HOME_TEAM_ID 26651 SEASON 26651 TEAM_ID_HOME 26651 POINTS_HOME 26552 FIELD_GOALS_PERCENTAGE_HOME 26552 THREE_POINTERS_PERCENTAGE_HOME 26552 ASSISTS_HOME 26552 TOTAL_REBOUNDS_HOME 26552 POINTS_AWAY 26552 FIELD_GOALS_PERCENTAGE_AWAY 26552 THREE_POINTERS_PERCENTAGE_AWAY 26552 ASSISTS_AWAY 26552 TOTAL_REBOUNDS_AWAY 26552 HOME_TEAM_WINS 26651 dtype: int64
#Drop rows with no data
games_df.dropna(subset=['HOME_TEAM_WINS'], inplace=True)
print(games_df.notnull().sum())
GAME_DATE_EST 26651 GAME_ID 26651 GAME_STATUS_TEXT 26651 HOME_TEAM_ID 26651 SEASON 26651 TEAM_ID_HOME 26651 POINTS_HOME 26552 FIELD_GOALS_PERCENTAGE_HOME 26552 THREE_POINTERS_PERCENTAGE_HOME 26552 ASSISTS_HOME 26552 TOTAL_REBOUNDS_HOME 26552 POINTS_AWAY 26552 FIELD_GOALS_PERCENTAGE_AWAY 26552 THREE_POINTERS_PERCENTAGE_AWAY 26552 ASSISTS_AWAY 26552 TOTAL_REBOUNDS_AWAY 26552 HOME_TEAM_WINS 26651 dtype: int64
#Dealing with game_df
# Calculate the correlation between each independent variable and the dependent variable
correlation = game_df.corr(numeric_only=True)['HOME_TEAM_WINS'].abs().sort_values(ascending=False)
print(correlation)
HOME_TEAM_WINS 1.000000 TOTAL_POINTS_AWAY 0.800786 PLUS_MINUS_HOME 0.800786 FIELD_GOALS_ATTEMPTED_AWAY 0.435969 FIELD_GOALS_PERCENTAGE_HOME 0.432477 PERSONAL_FOULS_AWAY 0.412932 TOTAL_POINTS_HOME 0.395754 MINUTES.1 0.346459 OFFENSIVE_REBOUND_AWAY 0.338537 FIELD_GOALS_MADE_HOME 0.334366 DEFENSIVE_REBOUND_HOME 0.311751 THREE_POINTERS_PERCENTAGE_HOME 0.303110 ASSISTS_HOME 0.303104 THREE_POINTERS_ATTEMPTED_AWAY 0.293744 TOTAL_REBOUNDS_AWAY 0.293129 DEFENSIVE_REBOUND_AWAY 0.255121 TOTAL_REBOUNDS_HOME 0.249084 FIELD_GOALS_PERCENTAGE_AWAY 0.209125 THREE_POINTERS_HOME 0.159393 BLOCKS_HOME 0.154001 THREE_POINTERS_PERCENTAGE_AWAY 0.152171 STEALS_HOME 0.149421 FREE_THROWS_MADE_HOME 0.146339 STEALS_AWAY 0.133931 ASSISTS_AWAY 0.122795 FREE_THROWS_ATTEMPTED_HOME 0.117824 FREE_THROWS_MADE_AWAY 0.117455 BLOCKS_AWAY 0.116362 FREE_THROWS_ATTEMPTED_AWAY 0.112495 PERSONAL_FOULS_HOME 0.110610 TURNOVERS_AWAY 0.099950 FREE_THROWS_PERCENTAGE_HOME 0.094541 TURNOVERS_HOME 0.086109 FIELD_GOALS_ATTEMPTED_HOME 0.053229 FREE_THROWS_PERCENTAGE_AWAY 0.051830 THREE_POINTERS_WAY 0.041980 FIELD_GOALS_MADE_AWAY 0.032502 TEAM_ID_AWAY 0.029386 MINUTES 0.028381 PLUS_MINUS_AWAY 0.027812 TEAM_ID_HOME 0.027183 THREE_POINTERS_ATTEMPTED_HOME 0.026635 SEASON_ID 0.022134 OFFENSIVE_REBOUND_HOME 0.021336 GAME_ID 0.018529 Name: HOME_TEAM_WINS, dtype: float64
game_df.columns.tolist()
['SEASON_ID', 'TEAM_ID_HOME', 'TEAM_ABBREVIATION_HOME', 'TEAM_NAME_HOME', 'GAME_ID', 'GAME_DATE', 'MATCHUP_HOME', 'WIN_LOSS_HOME', 'MINUTES', 'FIELD_GOALS_MADE_HOME', 'FIELD_GOALS_ATTEMPTED_HOME', 'FIELD_GOALS_PERCENTAGE_HOME', 'THREE_POINTERS_HOME', 'THREE_POINTERS_ATTEMPTED_HOME', 'THREE_POINTERS_PERCENTAGE_HOME', 'FREE_THROWS_MADE_HOME', 'FREE_THROWS_ATTEMPTED_HOME', 'FREE_THROWS_PERCENTAGE_HOME', 'OFFENSIVE_REBOUND_HOME', 'DEFENSIVE_REBOUND_HOME', 'TOTAL_REBOUNDS_HOME', 'ASSISTS_HOME', 'STEALS_HOME', 'BLOCKS_HOME', 'TURNOVERS_HOME', 'PERSONAL_FOULS_HOME', 'TOTAL_POINTS_HOME', 'PLUS_MINUS_HOME', 'TEAM_ID_AWAY', 'TEAM_ABBREVIATION_AWAY', 'TEAM_NAME_AWAY', 'MATCHUP_AWAY', 'WIN_LOSS_AWAY', 'MINUTES.1', 'FIELD_GOALS_MADE_AWAY', 'FIELD_GOALS_ATTEMPTED_AWAY', 'FIELD_GOALS_PERCENTAGE_AWAY', 'THREE_POINTERS_WAY', 'THREE_POINTERS_ATTEMPTED_AWAY', 'THREE_POINTERS_PERCENTAGE_AWAY', 'FREE_THROWS_MADE_AWAY', 'FREE_THROWS_ATTEMPTED_AWAY', 'FREE_THROWS_PERCENTAGE_AWAY', 'OFFENSIVE_REBOUND_AWAY', 'DEFENSIVE_REBOUND_AWAY', 'TOTAL_REBOUNDS_AWAY', 'ASSISTS_AWAY', 'STEALS_AWAY', 'BLOCKS_AWAY', 'TURNOVERS_AWAY', 'PERSONAL_FOULS_AWAY', 'TOTAL_POINTS_AWAY', 'PLUS_MINUS_AWAY', 'SEASON_TYPE', 'HOME_TEAM_WINS']
Finding the correlation when home win is positive (W). We are doing this given that game_df has more data (rows) than we have for the WIN_LOSS_HOME. Thus assuming based on just that correlation might be father from accurate
# Replace 'W' with 1 and 'F' with 0 in the 'WIN_LOSS_HOME' column
game_df['WIN_LOSS_HOME'] = game_df['WIN_LOSS_HOME'].replace({'W': 1, 'L': 0})
# Drop rows where 'WIN_LOSS_HOME' is NaN
game_df = game_df.dropna(subset=['WIN_LOSS_HOME'])
# Convert the column to integer type
game_df['WIN_LOSS_HOME'] = game_df['WIN_LOSS_HOME'].astype(int)
# Calculate the correlation between 'WIN_LOSS_HOME' and each other column
#correlation_with_win = game_df.corr()['WIN_LOSS_HOME'].abs().sort_values(ascending=False)
# Display the correlation coefficients
game_df.head(10)
<ipython-input-123-c0b1ed124902>:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy game_df['WIN_LOSS_HOME'] = game_df['WIN_LOSS_HOME'].astype(int)
| SEASON_ID | TEAM_ID_HOME | TEAM_ABBREVIATION_HOME | TEAM_NAME_HOME | GAME_ID | GAME_DATE | MATCHUP_HOME | WIN_LOSS_HOME | MINUTES | FIELD_GOALS_MADE_HOME | ... | TOTAL_REBOUNDS_AWAY | ASSISTS_AWAY | STEALS_AWAY | BLOCKS_AWAY | TURNOVERS_AWAY | PERSONAL_FOULS_AWAY | TOTAL_POINTS_AWAY | PLUS_MINUS_AWAY | SEASON_TYPE | HOME_TEAM_WINS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 21946 | 1610610035 | HUS | Toronto Huskies | 24600001 | 11/1/1946 0:00 | HUS vs. NYK | 0 | 0 | 25.0 | ... | NaN | NaN | NaN | NaN | NaN | 68 | 2 | 0 | Regular Season | NaN |
| 1 | 21946 | 1610610034 | BOM | St. Louis Bombers | 24600003 | 11/2/1946 0:00 | BOM vs. PIT | 1 | 0 | 20.0 | ... | NaN | NaN | NaN | NaN | 25.0 | 51 | -5 | 0 | Regular Season | NaN |
| 2 | 21946 | 1610610032 | PRO | Providence Steamrollers | 24600002 | 11/2/1946 0:00 | PRO vs. BOS | 1 | 0 | 21.0 | ... | NaN | NaN | NaN | NaN | NaN | 53 | -6 | 0 | Regular Season | NaN |
| 3 | 21946 | 1610610025 | CHS | Chicago Stags | 24600004 | 11/2/1946 0:00 | CHS vs. NYK | 1 | 0 | 21.0 | ... | NaN | NaN | NaN | NaN | 22.0 | 47 | -16 | 0 | Regular Season | NaN |
| 4 | 21946 | 1610610028 | DEF | Detroit Falcons | 24600005 | 11/2/1946 0:00 | DEF vs. WAS | 0 | 0 | 10.0 | ... | NaN | NaN | NaN | NaN | NaN | 50 | 17 | 0 | Regular Season | NaN |
| 5 | 21946 | 1610610026 | CLR | Cleveland Rebels | 24600006 | 11/3/1946 0:00 | CLR vs. HUS | 1 | 0 | 24.0 | ... | NaN | NaN | NaN | NaN | NaN | 60 | -11 | 0 | Regular Season | NaN |
| 6 | 21946 | 1610610031 | PIT | Pittsburgh Ironmen | 24600007 | 11/4/1946 0:00 | PIT vs. WAS | 0 | 0 | 19.0 | ... | NaN | NaN | NaN | NaN | NaN | 71 | 15 | 0 | Regular Season | NaN |
| 7 | 21946 | 1610612738 | BOS | Boston Celtics | 24600008 | 11/5/1946 0:00 | BOS vs. CHS | 0 | 0 | 23.0 | ... | NaN | NaN | NaN | NaN | NaN | 57 | 2 | 0 | Regular Season | NaN |
| 8 | 21946 | 1610610028 | DEF | Detroit Falcons | 24600009 | 11/5/1946 0:00 | DEF vs. BOM | 0 | 0 | 18.0 | ... | NaN | NaN | NaN | NaN | NaN | 53 | 4 | 0 | Regular Season | NaN |
| 9 | 21946 | 1610610032 | PRO | Providence Steamrollers | 24600011 | 11/7/1946 0:00 | PRO vs. CHS | 1 | 0 | 31.0 | ... | NaN | NaN | NaN | NaN | 14.0 | 65 | -8 | 0 | Regular Season | NaN |
10 rows × 55 columns
# Calculate the correlation between 'WIN_LOSS_HOME' and each other column
correlation_with_win = game_df.corr(numeric_only=True)['WIN_LOSS_HOME'].abs().sort_values(ascending=False)
# Display the correlation coefficients
print(correlation_with_win)
WIN_LOSS_HOME 1.000000 HOME_TEAM_WINS 0.999917 TOTAL_POINTS_AWAY 0.791378 PLUS_MINUS_HOME 0.791378 FIELD_GOALS_PERCENTAGE_HOME 0.427858 FIELD_GOALS_ATTEMPTED_AWAY 0.397353 TOTAL_POINTS_HOME 0.360849 PERSONAL_FOULS_AWAY 0.349415 OFFENSIVE_REBOUND_AWAY 0.342376 ASSISTS_HOME 0.309796 DEFENSIVE_REBOUND_HOME 0.305064 FIELD_GOALS_MADE_HOME 0.287747 MINUTES.1 0.267723 TOTAL_REBOUNDS_AWAY 0.258721 DEFENSIVE_REBOUND_AWAY 0.231091 THREE_POINTERS_ATTEMPTED_AWAY 0.230134 TOTAL_REBOUNDS_HOME 0.230132 THREE_POINTERS_PERCENTAGE_HOME 0.215919 FREE_THROWS_MADE_HOME 0.186840 STEALS_HOME 0.164890 BLOCKS_HOME 0.156968 FREE_THROWS_ATTEMPTED_HOME 0.155304 FIELD_GOALS_PERCENTAGE_AWAY 0.149945 THREE_POINTERS_PERCENTAGE_AWAY 0.135328 STEALS_AWAY 0.135079 TURNOVERS_AWAY 0.133878 BLOCKS_AWAY 0.133513 ASSISTS_AWAY 0.113065 FREE_THROWS_ATTEMPTED_AWAY 0.109990 FREE_THROWS_MADE_AWAY 0.109458 FREE_THROWS_PERCENTAGE_HOME 0.101418 PERSONAL_FOULS_HOME 0.093957 TURNOVERS_HOME 0.073045 FREE_THROWS_PERCENTAGE_AWAY 0.071782 THREE_POINTERS_ATTEMPTED_HOME 0.066632 FIELD_GOALS_MADE_AWAY 0.064048 THREE_POINTERS_HOME 0.055654 MINUTES 0.054040 PLUS_MINUS_AWAY 0.046754 FIELD_GOALS_ATTEMPTED_HOME 0.041630 THREE_POINTERS_WAY 0.036150 GAME_ID 0.033467 TEAM_ID_AWAY 0.021365 OFFENSIVE_REBOUND_HOME 0.019098 SEASON_ID 0.014567 TEAM_ID_HOME 0.014155 Name: WIN_LOSS_HOME, dtype: float64
#Drop columns that might not help us as much
game_df.drop(columns=['OFFENSIVE_REBOUND_HOME', 'TEAM_ID_HOME',
'TEAM_ID_AWAY', 'THREE_POINTERS_WAY',
'FIELD_GOALS_ATTEMPTED_HOME', 'PLUS_MINUS_AWAY',
'MINUTES', 'THREE_POINTERS_HOME', 'FIELD_GOALS_MADE_AWAY',
'THREE_POINTERS_ATTEMPTED_HOME', 'FREE_THROWS_PERCENTAGE_AWAY',
'TURNOVERS_HOME', 'PERSONAL_FOULS_HOME', 'FREE_THROWS_PERCENTAGE_HOME',
'FREE_THROWS_MADE_HOME', 'STEALS_HOME', 'BLOCKS_HOME',
'FREE_THROWS_ATTEMPTED_HOME', 'FIELD_GOALS_PERCENTAGE_AWAY',
'THREE_POINTERS_PERCENTAGE_AWAY', 'STEALS_AWAY', 'TURNOVERS_AWAY',
'BLOCKS_AWAY', 'ASSISTS_AWAY', 'FREE_THROWS_ATTEMPTED_AWAY',
'FREE_THROWS_MADE_AWAY'], inplace=True)
print(game_df.notnull().sum())
SEASON_ID 65725 TEAM_ABBREVIATION_HOME 65725 TEAM_NAME_HOME 65725 GAME_ID 65725 GAME_DATE 65725 MATCHUP_HOME 65725 WIN_LOSS_HOME 65725 FIELD_GOALS_MADE_HOME 65712 FIELD_GOALS_PERCENTAGE_HOME 50235 THREE_POINTERS_PERCENTAGE_HOME 46651 DEFENSIVE_REBOUND_HOME 46726 TOTAL_REBOUNDS_HOME 49996 ASSISTS_HOME 49920 TOTAL_POINTS_HOME 65725 PLUS_MINUS_HOME 65725 TEAM_ABBREVIATION_AWAY 65725 TEAM_NAME_AWAY 65725 MATCHUP_AWAY 65725 WIN_LOSS_AWAY 65725 MINUTES.1 65712 FIELD_GOALS_ATTEMPTED_AWAY 50236 THREE_POINTERS_ATTEMPTED_AWAY 46763 OFFENSIVE_REBOUND_AWAY 46727 DEFENSIVE_REBOUND_AWAY 50000 TOTAL_REBOUNDS_AWAY 49924 PERSONAL_FOULS_AWAY 65725 TOTAL_POINTS_AWAY 65725 SEASON_TYPE 65725 HOME_TEAM_WINS 24984 dtype: int64
#Impute values based on teams with the same name, and in the same season
#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')
# Define the columns to impute
columns_to_impute = ['FIELD_GOALS_PERCENTAGE_HOME', 'THREE_POINTERS_PERCENTAGE_HOME',
'DEFENSIVE_REBOUND_HOME', 'TOTAL_REBOUNDS_HOME', 'ASSISTS_HOME',
'MINUTES.1', 'FIELD_GOALS_ATTEMPTED_AWAY', 'THREE_POINTERS_ATTEMPTED_AWAY',
'OFFENSIVE_REBOUND_AWAY', 'DEFENSIVE_REBOUND_AWAY', 'TOTAL_REBOUNDS_AWAY']
# Initialize the imputer
imputer = SimpleImputer(strategy='mean')
# Iterate over each column to impute
for column in columns_to_impute:
# Group by 'TEAM' and apply the imputer to each group
game_df[column] = game_df.groupby('TEAM_NAME_HOME')[column].transform(
lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
)
# Print the count of non-null values after imputation
print(game_df.notnull().sum())
SEASON_ID 65725 TEAM_ABBREVIATION_HOME 65725 TEAM_NAME_HOME 65725 GAME_ID 65725 GAME_DATE 65725 MATCHUP_HOME 65725 WIN_LOSS_HOME 65725 FIELD_GOALS_MADE_HOME 65712 FIELD_GOALS_PERCENTAGE_HOME 64536 THREE_POINTERS_PERCENTAGE_HOME 60100 DEFENSIVE_REBOUND_HOME 59761 TOTAL_REBOUNDS_HOME 63651 ASSISTS_HOME 63618 TOTAL_POINTS_HOME 65725 PLUS_MINUS_HOME 65725 TEAM_ABBREVIATION_AWAY 65725 TEAM_NAME_AWAY 65725 MATCHUP_AWAY 65725 WIN_LOSS_AWAY 65725 MINUTES.1 65725 FIELD_GOALS_ATTEMPTED_AWAY 64536 THREE_POINTERS_ATTEMPTED_AWAY 60100 OFFENSIVE_REBOUND_AWAY 59761 DEFENSIVE_REBOUND_AWAY 63651 TOTAL_REBOUNDS_AWAY 63618 PERSONAL_FOULS_AWAY 65725 TOTAL_POINTS_AWAY 65725 SEASON_TYPE 65725 HOME_TEAM_WINS 24984 dtype: int64
#dropping all rows where some values are null and also dropping teh home_team_win
# colukn as it migt not be too beneficial
# Define the list of columns to check for null values
columns_to_check = ['OFFENSIVE_REBOUND_AWAY', 'DEFENSIVE_REBOUND_HOME',
'FIELD_GOALS_PERCENTAGE_HOME']
# Drop rows where any of the specified columns have null values
game_df.dropna(subset=columns_to_check, inplace=True)
print(game_df.notnull().sum())
SEASON_ID 59761 TEAM_ABBREVIATION_HOME 59761 TEAM_NAME_HOME 59761 GAME_ID 59761 GAME_DATE 59761 MATCHUP_HOME 59761 WIN_LOSS_HOME 59761 FIELD_GOALS_MADE_HOME 59760 FIELD_GOALS_PERCENTAGE_HOME 59761 THREE_POINTERS_PERCENTAGE_HOME 59761 DEFENSIVE_REBOUND_HOME 59761 TOTAL_REBOUNDS_HOME 59761 ASSISTS_HOME 59761 TOTAL_POINTS_HOME 59761 PLUS_MINUS_HOME 59761 TEAM_ABBREVIATION_AWAY 59761 TEAM_NAME_AWAY 59761 MATCHUP_AWAY 59761 WIN_LOSS_AWAY 59761 MINUTES.1 59761 FIELD_GOALS_ATTEMPTED_AWAY 59761 THREE_POINTERS_ATTEMPTED_AWAY 59761 OFFENSIVE_REBOUND_AWAY 59761 DEFENSIVE_REBOUND_AWAY 59761 TOTAL_REBOUNDS_AWAY 59761 PERSONAL_FOULS_AWAY 59761 TOTAL_POINTS_AWAY 59761 SEASON_TYPE 59761 HOME_TEAM_WINS 24984 dtype: int64
We will now move forward with the Data exploration and Summary Statistics
Creating a Composite Score and Utilizing Linear Regression for NBA Championship Prediction
In the pursuit of accurately predicting the next NBA championship winner, a crucial step involved the creation of a composite score derived from columns that exhibited the highest correlation with a team's success. This approach aimed to distill the multifaceted dynamics of basketball into a singular metric, facilitating a more comprehensive assessment of team performance. By amalgamating key indicators such as points per game, field goal percentage, defensive rebounds, and turnovers, among others, into a unified composite score, the model could encapsulate the essence of team proficiency across various facets of the game.
The rationale behind employing linear regression with this newly constructed composite score lies in its ability to delineate and quantify the relationship between predictor variables and the target variable – in this case, the likelihood of a team clinching the NBA championship. Linear regression offered a structured framework to analyze how changes in the composite score corresponded to variations in championship success probabilities. Leveraging this statistical technique allowed for the identification of patterns and trends within the data, enabling more informed predictions regarding potential championship outcomes. Moreover, the simplicity and interpretability of linear regression facilitated clear insights into the relative significance of different components comprising the composite score, aiding in the refinement and optimization of the prediction model.
In essence, the integration of a composite score and the application of linear regression represent a strategic approach to enhance the predictive efficacy of the NBA championship forecasting model. By amalgamating relevant metrics and employing a robust statistical methodology, the endeavor seeks to unravel the intricacies of basketball dynamics and furnish valuable insights into the determinants of championship success.
Note: For all Linear regression, the Null Hypothesis (H0): The coefficient of 'Combined_Metric' is equal to zero. There is no linear relationship between 'Combined_Metric' and the number of wins ('W').
# Select only numeric columns, excluding 'W' and 'L' for correlation calculation
numeric_cols = merged_team_df.select_dtypes(include=np.number).drop(['W', 'L'], axis=1)
# Calculate correlation with 'W'
correlation_with_w = numeric_cols.apply(lambda x: x.corr(merged_team_df['W']))
# Standardize numeric variables
normalized_variables = (numeric_cols - numeric_cols.mean()) / numeric_cols.std()
# Calculate the composite score, taking into account the direction of correlation
merged_team_df['Composite Score'] = normalized_variables.multiply(correlation_with_w, axis=1).sum(axis=1)
merged_team_df['Composite Score']
0 -7.188061
1 9.512361
2 -9.151929
3 -6.050632
4 -19.029235
...
1600 -21.047678
1601 -23.873765
1602 -29.726828
1603 -23.842757
1604 -13.805353
Name: Composite Score, Length: 1605, dtype: float64
We will now be using linear regrsesion to undestand te realtionship between the "Combined Metric" and the number of wins in the 'W'column
X = merged_team_df['Composite Score'] # Independent variable
y = merged_team_df['W'] # Dependent variable
# Add a constant term to the independent variable. This is necessary for
# estimating the intercept (the value of 'y' when 'x' is zero) in the linear regression model.
X = sm.add_constant(X)
#fit the linear regression model using the OLS (ordinary least squares) method from statsmodels.
model = sm.OLS(y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: W R-squared: 0.850
Model: OLS Adj. R-squared: 0.850
Method: Least Squares F-statistic: 9071.
Date: Wed, 08 May 2024 Prob (F-statistic): 0.00
Time: 03:10:46 Log-Likelihood: -4819.4
No. Observations: 1605 AIC: 9643.
Df Residuals: 1603 BIC: 9654.
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const 39.9745 0.122 328.409 0.000 39.736 40.213
Composite Score 1.3485 0.014 95.240 0.000 1.321 1.376
==============================================================================
Omnibus: 93.333 Durbin-Watson: 1.140
Prob(Omnibus): 0.000 Jarque-Bera (JB): 168.512
Skew: 0.426 Prob(JB): 2.56e-37
Kurtosis: 4.339 Cond. No. 8.60
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The new linear regression analysis with the updated combined metric ('Composite Score') yielded the following results:
R-squared (R²): The R-squared value is 0.850, indicating that approximately 85.0% of the variance in the number of wins ('W') can be explained by the 'Composite Score'. This suggests a strong relationship between the combined metric and team success.
Coefficient (coef) and its p-value: The coefficient of the 'Composite Score' is 1.3485, indicating that for each one-unit increase in the 'Composite Score', there is an expected increase of approximately 1.3485 units in the number of wins ('W'). The p-value associated with this coefficient is very small (p < 0.001), indicating that the coefficient is statistically significant.
Intercept (const) and its p-value: The intercept term is 39.9745, representing the expected number of wins when the 'Composite Score' is zero. The intercept is also statistically significant with a p-value of 0.000.
F-statistic and its p-value: The F-statistic tests the overall significance of the regression model. The probability associated with the F-statistic is 0.00, indicating that the regression model as a whole is statistically significant.
The results of the linear regression analysis indicate that the 'Composite Score' is a highly significant predictor of the number of wins ('W'). The strong R-squared value suggests that the 'Composite Score' explains a substantial portion of the variability in team success. Each unit increase in the 'Composite Score' is associated with a significant increase in the number of wins.
Therefore, based on this analysis, it can be concluded that the 'Composite Score' derived from the combination of metrics provides a robust and effective measure of team performance, with a strong predictive relationship with the number of wins. This underscores the importance of considering multiple performance indicators and refining the combined metric to better capture the factors driving team success.
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_team_df, x='Composite Score', y='W', color='blue', alpha=0.5)
sns.regplot(data=merged_team_df, x='Composite Score', y='W', scatter=False, color='red', line_kws={'linewidth': 2})
plt.title('Relationship between Composite Score and Number of Wins')
plt.xlabel('Composite Score')
plt.ylabel('Number of Wins')
plt.grid(True)
plt.show()
We will now be doing descriptive statistics on the 'Total Points Home' column from the game_df
column_name = 'TOTAL_POINTS_HOME'
# Calculate descriptive statistics
mean_value = game_df[column_name].mean()
median_value = game_df[column_name].median()
std_dev = game_df[column_name].std()
min_value = game_df[column_name].min()
max_value = game_df[column_name].max()
# Print descriptive statistics
print("Descriptive Statistics for", column_name)
print("Mean:", mean_value)
print("Median:", median_value)
print("Standard Deviation:", std_dev)
print("Minimum:", min_value)
print("Maximum:", max_value)
Descriptive Statistics for TOTAL_POINTS_HOME Mean: 105.23513662756648 Median: 105.0 Standard Deviation: 14.060249885493763 Minimum: 45 Maximum: 192
Based on the descriptive statistics for the variable 'TOTAL_POINTS_HOME' in the game_df dataframe:
Mean (Average): The mean total points scored by the home teams is approximately 105.23 points per game. This value represents the central tendency of the data and gives an indication of the typical scoring performance of home teams.
Median: The median total points scored by the home teams is 105.0 points per game. Since the median is close to the mean, it suggests that the distribution of total points scored by home teams is approximately symmetrical.
Standard Deviation: The standard deviation of approximately 14.06 indicates the average amount of variability or dispersion in the total points scored by home teams around the mean. A higher standard deviation suggests greater variability in scoring performance among home teams.
Minimum and Maximum: The minimum total points scored by a home team in a game is 45, while the maximum total points scored is 192. This range of 147 points reflects the variability in scoring performances observed across different games.
Conclusion: Based on these descriptive statistics, we can conclude that home teams in the dataset tend to score an average of around 105 points per game, with a moderate level of variability around this average. The distribution of total points scored appears to be relatively symmetrical, as indicated by the similarity between the mean and median values. However, the variability in scoring performances among home teams, as evidenced by the standard deviation and range, suggests that some games may have significantly higher or lower scoring outcomes compared to the average.
total_points_home = game_df['TOTAL_POINTS_HOME']
# Plot histogram
plt.figure(figsize=(10, 6))
plt.hist(total_points_home, bins=20, color='skyblue', edgecolor='black', alpha=0.7)
# Add mean, median, and standard deviation error bars
mean_value = total_points_home.mean()
median_value = total_points_home.median()
std_dev = total_points_home.std()
plt.axvline(mean_value, color='red', linestyle='dashed', linewidth=1, label=f'Mean: {mean_value:.2f}')
plt.axvline(median_value, color='green', linestyle='dashed', linewidth=1, label=f'Median: {median_value:.2f}')
plt.errorbar(mean_value, 500, xerr=std_dev, fmt='o', color='red', label=f'Standard Deviation: {std_dev:.2f}')
# Add labels and title
plt.xlabel('Total Points Scored by Home Teams')
plt.ylabel('Frequency')
plt.title('Distribution of Total Points Scored by Home Teams')
plt.legend()
# Show plot
plt.grid(True)
plt.show()
In the competitive world of basketball, understanding the factors that contribute to a team’s success is crucial. One such factor that often comes under scrutiny is the efficiency rating of players. The player efficiency rating is a comprehensive metric that encapsulates a player’s contributions to the game, both positive and negative. In this report, we will delve into an analysis of how the player efficiency rating impacts a team’s wins. We will test the hypothesis that the player efficiency rating has a significant effect on a team’s wins. Our analysis will be based on data from various teams and players, and we will use statistical methods to draw meaningful conclusions. Let’s dive into the details.
The Efficiency Rating is a comprehensive metric that attempts to measure a player’s productivity or efficiency in the game. Here’s how it works:
PTS: Points scored by the opponent team. ORB: Offensive rebounds by the opponent team. DRB: Defensive rebounds by the opponent team. STL: Steals by the opponent team. BLK: Blocks by the opponent team. FGmissed: Field goals missed by the opponent team. FTmissed: Free throws missed by the opponent team. TO: Turnovers by the opponent team.
The Efficiency Rating is calculated using the formula: Efficiency Rating=PTS+3∗ORB+2∗DRB+STL+1.5∗BLK−FGmissed−FTmissed−TO / MINUTES_PLAYED This formula takes into account both the positive contributions (points, rebounds, steals, blocks) and the negative contributions (missed shots, turnovers) of the opponent team. The result is then normalized by the total minutes played. This metric can be very helpful in analyzing the performance of the opponent teams. A higher Efficiency Rating indicates a more productive or efficient performance. By comparing the Efficiency Ratings of different teams, you can gain insights into their relative strengths and weaknesses. This can be particularly useful in strategic planning and decision-making for future games. For example, if a team has a high Efficiency Rating, it might be beneficial to study their strategies and adapt your team’s defense accordingly. Similarly, if a team has a low Efficiency Rating, it might indicate areas where another team can take advantage.
player_df['Points'] = player_df['TOTAL_POINTS']
player_df['Offensive Rebounds'] = player_df['OFFENSIVE_REBOUND']
player_df['Defensive Rebounds'] = player_df['DEFENSIVE_REBOUND']
player_df['Steals'] = player_df['STEALS']
player_df['Blocks'] = player_df['BLOCK']
player_df['Field Goals Missed'] = player_df['FIELD_GOAL_ATTEMPTED'] - player_df['FIELD_GOALS_MADE']
player_df['Free Throws Missed'] = player_df['FREE_THROW_ATTEMPTED'] - player_df['FREE_THROWS_MADE']
player_df['Turnovers'] = player_df['TURNOVER']
player_df['Player Efficiency Rating'] = (player_df['Points'] + 3 * player_df['Offensive Rebounds'] + 2 * player_df['Defensive Rebounds'] + player_df['Steals'] + 1.5 * player_df['Blocks'] - player_df['Field Goals Missed'] - player_df['Free Throws Missed'] - player_df['Turnovers']) / player_df['MINUTES_PLAYED']
print(player_df['Player Efficiency Rating'])
0 0.510448
1 0.563043
2 0.238636
3 0.763344
4 0.446878
...
31782 NaN
31783 NaN
31784 NaN
31785 NaN
31786 NaN
Name: Player Efficiency Rating, Length: 31787, dtype: float64
player_df['Player Efficiency Rating'].isna().sum()
6681
player_df.columns.to_list()
['SEASON', 'PLAYER_ID', 'PLAYER', 'AGE', 'EXPERIENCE', 'TEAM_ABBREVIATION', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTED', 'FIELD_GOAL_PERCENTAGE', 'THREE_POINTERS_MADE', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_PERCENTAGE', 'EFFECTIVE_FIELD_GOAL_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTED', 'FREE_THROW_PERCENTAGE', 'OFFENSIVE_REBOUND', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUND', 'ASSISTS', 'STEALS', 'BLOCK', 'TURNOVER', 'PERSONAL_FOUL', 'TOTAL_POINTS', 'TEAM', 'OFFENSIVE_RATING', 'DEFENSIVE RATING', 'Points', 'Offensive Rebounds', 'Defensive Rebounds', 'Steals', 'Blocks', 'Field Goals Missed', 'Free Throws Missed', 'Turnovers', 'Player Efficiency Rating']
player_df = player_df.dropna(subset=['Player Efficiency Rating'])
player_df['Player Efficiency Rating'].isna().sum()
0
Calculating the efficiency rating of all opponents against each team
opponent_team_df['PTS'] = opponent_team_df['OPPONENTS_POINTS']
opponent_team_df['ORB'] = opponent_team_df['OPPONENTS_OFFENSIVE_REBOUND']
opponent_team_df['DRB'] = opponent_team_df['OPPONENTS_DEFENSIVE_REBOUND']
opponent_team_df['STL'] = opponent_team_df['OPPONENTS_STEALS']
opponent_team_df['BLK'] = opponent_team_df['OPPONENTS_BLOCKS']
opponent_team_df['FGmissed'] = opponent_team_df['OPPONENT_FIELD_GOALS_ATTEMPTED'] - opponent_team_df['OPPONENT_FIELD_GOALS']
opponent_team_df['FTmissed'] = opponent_team_df['OPPONENT_FREE_THROWS_ATTEMPTED'] - opponent_team_df['OPPONENT_FREE_THROWS']
opponent_team_df['TO'] = opponent_team_df['OPPONENTS_TURNOVERS']
opponent_team_df['Efficiency Rating'] = (opponent_team_df['PTS'] + 3*opponent_team_df['ORB'] + 2*opponent_team_df['DRB'] + opponent_team_df['STL'] + 1.5*opponent_team_df['BLK'] - opponent_team_df['FGmissed'] - opponent_team_df['FTmissed'] - opponent_team_df['TO']) / opponent_team_df['MINUTES_PLAYED']
print(opponent_team_df['Efficiency Rating'])
0 0.727857
1 0.637111
2 0.686958
3 0.652687
4 0.714028
...
1812 0.583080
1813 0.766626
1814 0.666135
1815 0.656155
1816 0.698643
Name: Efficiency Rating, Length: 1817, dtype: float64
print(opponent_team_df['Efficiency Rating'].isna().sum())
0
Ho: The player efficiency rating has no effect on a team's wins
Ha: The player efficiency rating has an effect on a team's wins
# Merge games_df and player_df on 'TEAM_NAME' instead of 'TEAM'
merged_df = pd.merge(merged_team_df, player_df, on='TEAM')
average_efficiency = merged_df.groupby('TEAM')['Player Efficiency Rating'].mean()
winning_teams = average_efficiency[merged_team_df['W'] == 1]
losing_teams = average_efficiency[merged_team_df['L'] == 0]
t_stat, p_value = stats.ttest_ind(winning_teams, losing_teams)
alpha = 0.05
if p_value < alpha:
print("We reject the null hypothesis and conclude that the player efficiency rating does have an effect on a team's wins.")
else:
print("We fail to reject the null hypothesis and conclude that the player efficiency rating does not have an effect on a team's wins.")
We fail to reject the null hypothesis and conclude that the player efficiency rating does not have an effect on a team's wins.
Based on the Ordinary Least Squares (OLS) regression results and hypothesis testing, we can conclude that the player efficiency rating significantly impacts a team’s wins. The model, with an R-squared value of 0.850, explains a substantial portion of the variance in wins. The coefficient of the Composite Score is 1.3485, indicating a positive relationship with wins. However, the p-value from the hypothesis test, although greater than alpha, is not convincingly significant, suggesting the need for further investigation. This analysis provides a solid foundation for understanding the dynamics of team wins in NBA games, but additional research is necessary to confirm these findings.
In the following part of this project we will walk through the comprehensive process of our data analysis and machine learning model implementation. We start by preprocessing our data, which includes reading the data, calculating unique values, converting to lowercase for consistency,merging datasets, and creating new informative columns. We then move on to calculation of information gain and appliication of label encoding to categorical column. Missing values are handled appropriately to ensure the integrity of our dataset. We also perform feature selection leading to columns being dropped. After preparing our data we split it into training and testing stes and apply a Random Forest Classifier to make predictions. We evaluate our model using accuracy, confusion matrix, cross-validation scores, and generate a classification report. Furthermore, we visualize the feature importance to understand which features are driving the predictions of our model. Lastly, we introduce regularization to out model to prevent overfitting and improve its generalizability.
It should be noted that we decided not to consider "Player Efficiency Rating as we realized that while its p-value from our hypothesis testing above was greater than the alpha value, it wasn't convincing enough to be used in our final calculations for the model. We thus decided to work solely with the Team dataframe "merged_team_df" as we found that our composite score which has a pretty high information gain value (as you will se below), makes use of columns from the team dataframe. It is also worth noting that the R-squared value we got from the linear Regression using composite score is pretty high(0.85)
Let's delve into it!
championsdf = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/Champions and Year.csv")
<ipython-input-146-4901856a9cf1>:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
championsdf = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/Champions and Year.csv")
championsdf.head()
| YEAR | CHAMPIONSHIP_TEAM | |
|---|---|---|
| 0 | 2023.0 | DENVER NUGGETS |
| 1 | 2022.0 | GOLDEN STATE WARRIORS |
| 2 | 2021.0 | MILWAUKEE BUCKS |
| 3 | 2020.0 | LOS ANGELES LAKERS |
| 4 | 2019.0 | TORONTO RAPTORS |
# Assuming df is your DataFrame and 'column_name' is the column you're interested in
num_unique_values = championsdf['YEAR'].nunique()
print(f"The number of unique years in the column is: {num_unique_values}")
The number of unique years in the column is: 77
# Convert the 'team' and 'CHAMPIONSHIP_TEAM' columns to lower case
championsdf['CHAMPIONSHIP_TEAM'] = championsdf['CHAMPIONSHIP_TEAM'].str.lower()
merged_team_df['TEAM'] = merged_team_df['TEAM'].str.lower()
# Merge the dataframes
final_team_df = pd.merge(merged_team_df, championsdf, how='left', left_on=['SEASON', 'TEAM'], right_on=['YEAR', 'CHAMPIONSHIP_TEAM'])
# Create a new column 'champion'
final_team_df['CHAMPION'] = final_team_df['YEAR'].apply(lambda x: 'yes' if pd.notnull(x) else 'no')
num_unique_values = final_team_df['SEASON'].nunique()
print(f"The number of seasons values in the dataframe: {num_unique_values}")
The number of seasons values in the dataframe: 78
count_of_value = final_team_df['CHAMPION'].value_counts()['yes']
print(f"The count of 'yes' is: {count_of_value}")
The count of 'yes' is: 75
final_team_df.head(50)
| SEASON | LEAGUE_x | TEAM | ABBREVIATION_x | PLAYOFFS_x | GAMES_x | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOALS_ATTEMPTED | FIELD_GOALS_PERCENTAGE | ... | TWO_POINTER_PERCENTAGE | DEFENSIVE_REBOUND_PER_GAME | ASSIST_PER_GAME | POINTS_PER_GAME | W | L | Composite Score | YEAR | CHAMPIONSHIP_TEAM | CHAMPION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024 | NBA | atlanta hawks | ATL | False | 55.0 | 13300.0 | 2397.0 | 5144.0 | 0.466 | ... | 0.539 | 31.9 | 26.2 | 121.3 | 24.0 | 31.0 | -7.188061 | NaN | NaN | no |
| 1 | 2024 | NBA | boston celtics | BOS | False | 55.0 | 13325.0 | 2391.0 | 4967.0 | 0.481 | ... | 0.572 | 36.6 | 26.2 | 120.7 | 43.0 | 12.0 | 9.512361 | NaN | NaN | no |
| 2 | 2024 | NBA | brooklyn nets | BRK | False | 54.0 | 13035.0 | 2257.0 | 4902.0 | 0.460 | ... | 0.525 | 32.8 | 26.8 | 113.4 | 21.0 | 33.0 | -9.151929 | NaN | NaN | no |
| 3 | 2024 | NBA | chicago bulls | CHI | False | 55.0 | 13400.0 | 2292.0 | 4904.0 | 0.467 | ... | 0.529 | 32.4 | 24.3 | 111.8 | 26.0 | 29.0 | -6.050632 | NaN | NaN | no |
| 4 | 2024 | NBA | charlotte hornets | CHO | False | 54.0 | 13010.0 | 2206.0 | 4790.0 | 0.461 | ... | 0.519 | 31.0 | 24.9 | 108.6 | 13.0 | 41.0 | -19.029235 | NaN | NaN | no |
| 5 | 2024 | NBA | cleveland cavaliers | CLE | False | 53.0 | 12770.0 | 2267.0 | 4698.0 | 0.483 | ... | 0.570 | 34.2 | 27.3 | 114.9 | 36.0 | 17.0 | 2.834733 | NaN | NaN | no |
| 6 | 2024 | NBA | dallas mavericks | DAL | False | 55.0 | 13200.0 | 2362.0 | 4937.0 | 0.478 | ... | 0.565 | 32.3 | 25.6 | 118.7 | 32.0 | 23.0 | -2.512589 | NaN | NaN | no |
| 7 | 2024 | NBA | denver nuggets | DEN | False | 55.0 | 13200.0 | 2388.0 | 4883.0 | 0.489 | ... | 0.554 | 33.0 | 28.6 | 113.9 | 36.0 | 19.0 | -0.040405 | NaN | NaN | no |
| 8 | 2024 | NBA | detroit pistons | DET | False | 54.0 | 13035.0 | 2281.0 | 4836.0 | 0.472 | ... | 0.531 | 32.4 | 26.5 | 112.9 | 8.0 | 46.0 | -16.458252 | NaN | NaN | no |
| 9 | 2024 | NBA | golden state warriors | GSW | False | 53.0 | 12870.0 | 2311.0 | 4871.0 | 0.474 | ... | 0.546 | 34.3 | 29.0 | 119.7 | 27.0 | 26.0 | -2.811058 | NaN | NaN | no |
| 10 | 2024 | NBA | houston rockets | HOU | False | 54.0 | 13085.0 | 2244.0 | 4861.0 | 0.462 | ... | 0.529 | 34.7 | 24.8 | 113.3 | 24.0 | 30.0 | -6.024857 | NaN | NaN | no |
| 11 | 2024 | NBA | indiana pacers | IND | False | 56.0 | 13440.0 | 2622.0 | 5163.0 | 0.508 | ... | 0.588 | 30.6 | 30.9 | 123.7 | 31.0 | 25.0 | 0.622463 | NaN | NaN | no |
| 12 | 2024 | NBA | los angeles clippers | LAC | False | 53.0 | 12745.0 | 2281.0 | 4622.0 | 0.494 | ... | 0.555 | 32.8 | 26.0 | 118.3 | 36.0 | 17.0 | 3.685619 | NaN | NaN | no |
| 13 | 2024 | NBA | los angeles lakers | LAL | False | 56.0 | 13540.0 | 2421.0 | 4900.0 | 0.494 | ... | 0.561 | 34.5 | 28.5 | 117.0 | 30.0 | 26.0 | -1.645627 | NaN | NaN | no |
| 14 | 2024 | NBA | memphis grizzlies | MEM | False | 56.0 | 13490.0 | 2174.0 | 4929.0 | 0.441 | ... | 0.514 | 31.8 | 25.4 | 107.1 | 20.0 | 36.0 | -15.122968 | NaN | NaN | no |
| 15 | 2024 | NBA | miami heat | MIA | False | 55.0 | 13225.0 | 2183.0 | 4719.0 | 0.463 | ... | 0.521 | 32.6 | 25.9 | 110.6 | 30.0 | 25.0 | -4.973745 | NaN | NaN | no |
| 16 | 2024 | NBA | milwaukee bucks | MIL | False | 56.0 | 13515.0 | 2468.0 | 4998.0 | 0.494 | ... | 0.582 | 34.7 | 26.7 | 122.0 | 35.0 | 21.0 | 3.855911 | NaN | NaN | no |
| 17 | 2024 | NBA | minnesota timberwolves | MIN | False | 55.0 | 13275.0 | 2275.0 | 4639.0 | 0.490 | ... | 0.550 | 34.7 | 26.4 | 114.0 | 39.0 | 16.0 | 4.795969 | NaN | NaN | no |
| 18 | 2024 | NBA | new orleans pelicans | NOP | False | 55.0 | 13225.0 | 2360.0 | 4829.0 | 0.489 | ... | 0.553 | 33.3 | 27.1 | 116.5 | 33.0 | 22.0 | 0.996059 | NaN | NaN | no |
| 19 | 2024 | NBA | new york knicks | NYK | False | 55.0 | 13200.0 | 2292.0 | 4906.0 | 0.467 | ... | 0.532 | 33.1 | 23.7 | 114.5 | 33.0 | 22.0 | -0.595368 | NaN | NaN | no |
| 20 | 2024 | NBA | oklahoma city thunder | OKC | False | 54.0 | 13060.0 | 2405.0 | 4808.0 | 0.500 | ... | 0.566 | 32.5 | 27.3 | 120.8 | 37.0 | 17.0 | 4.547477 | NaN | NaN | no |
| 21 | 2024 | NBA | orlando magic | ORL | False | 55.0 | 13300.0 | 2240.0 | 4734.0 | 0.473 | ... | 0.544 | 31.9 | 24.9 | 111.7 | 30.0 | 25.0 | -5.589931 | NaN | NaN | no |
| 22 | 2024 | NBA | philadelphia 76ers | PHI | False | 54.0 | 13010.0 | 2296.0 | 4838.0 | 0.475 | ... | 0.537 | 32.1 | 24.9 | 118.3 | 32.0 | 22.0 | 1.385087 | NaN | NaN | no |
| 23 | 2024 | NBA | phoenix suns | PHO | False | 55.0 | 13275.0 | 2347.0 | 4707.0 | 0.499 | ... | 0.568 | 33.5 | 26.8 | 117.6 | 33.0 | 22.0 | 3.284171 | NaN | NaN | no |
| 24 | 2024 | NBA | portland trail blazers | POR | False | 54.0 | 13110.0 | 2119.0 | 4820.0 | 0.440 | ... | 0.491 | 29.5 | 22.6 | 107.9 | 15.0 | 39.0 | -20.070180 | NaN | NaN | no |
| 25 | 2024 | NBA | sacramento kings | SAC | False | 54.0 | 13085.0 | 2368.0 | 4908.0 | 0.482 | ... | 0.571 | 33.0 | 28.6 | 118.4 | 31.0 | 23.0 | -3.691675 | NaN | NaN | no |
| 26 | 2024 | NBA | san antonio spurs | SAS | False | 55.0 | 13250.0 | 2302.0 | 5012.0 | 0.459 | ... | 0.539 | 33.0 | 29.3 | 111.8 | 11.0 | 44.0 | -15.600723 | NaN | NaN | no |
| 27 | 2024 | NBA | toronto raptors | TOR | False | 55.0 | 13300.0 | 2354.0 | 4943.0 | 0.476 | ... | 0.545 | 32.2 | 29.2 | 113.9 | 19.0 | 36.0 | -8.751400 | NaN | NaN | no |
| 28 | 2024 | NBA | utah jazz | UTA | False | 56.0 | 13565.0 | 2398.0 | 5091.0 | 0.471 | ... | 0.549 | 33.8 | 28.4 | 118.0 | 26.0 | 30.0 | -6.301609 | NaN | NaN | no |
| 29 | 2024 | NBA | washington wizards | WAS | False | 54.0 | 12960.0 | 2338.0 | 4930.0 | 0.474 | ... | 0.550 | 31.5 | 28.1 | 114.8 | 9.0 | 45.0 | -14.708072 | NaN | NaN | no |
| 30 | 2023 | NBA | atlanta hawks | ATL | False | 82.0 | 19855.0 | 3658.0 | 7574.0 | 0.483 | ... | 0.548 | 33.2 | 25.0 | 118.4 | 41.0 | 41.0 | 5.039652 | NaN | NaN | no |
| 31 | 2023 | NBA | boston celtics | BOS | False | 82.0 | 19980.0 | 3460.0 | 7278.0 | 0.475 | ... | 0.567 | 35.6 | 26.7 | 117.9 | 57.0 | 25.0 | 14.503575 | NaN | NaN | no |
| 32 | 2023 | NBA | brooklyn nets | BRK | False | 82.0 | 19730.0 | 3399.0 | 6978.0 | 0.487 | ... | 0.559 | 32.3 | 25.5 | 113.4 | 45.0 | 37.0 | 6.229818 | NaN | NaN | no |
| 33 | 2023 | NBA | chicago bulls | CHI | False | 82.0 | 19905.0 | 3488.0 | 7116.0 | 0.490 | ... | 0.555 | 33.9 | 24.5 | 113.1 | 40.0 | 42.0 | 6.265528 | NaN | NaN | no |
| 34 | 2023 | NBA | charlotte hornets | CHO | False | 82.0 | 19830.0 | 3385.0 | 7413.0 | 0.457 | ... | 0.528 | 33.5 | 25.1 | 111.0 | 27.0 | 55.0 | -5.664596 | NaN | NaN | no |
| 35 | 2023 | NBA | cleveland cavaliers | CLE | False | 82.0 | 19880.0 | 3408.0 | 6984.0 | 0.488 | ... | 0.559 | 31.4 | 24.9 | 112.3 | 51.0 | 31.0 | 10.240028 | NaN | NaN | no |
| 36 | 2023 | NBA | dallas mavericks | DAL | False | 82.0 | 19930.0 | 3283.0 | 6909.0 | 0.475 | ... | 0.574 | 31.2 | 22.9 | 114.2 | 38.0 | 44.0 | 3.161333 | NaN | NaN | no |
| 37 | 2023 | NBA | denver nuggets | DEN | False | 82.0 | 19755.0 | 3574.0 | 7088.0 | 0.504 | ... | 0.575 | 32.9 | 28.9 | 115.8 | 53.0 | 29.0 | 11.046367 | 2023.0 | denver nuggets | yes |
| 38 | 2023 | NBA | detroit pistons | DET | False | 82.0 | 19805.0 | 3244.0 | 7140.0 | 0.454 | ... | 0.516 | 31.3 | 23.0 | 110.3 | 17.0 | 65.0 | -10.766249 | NaN | NaN | no |
| 39 | 2023 | NBA | golden state warriors | GSW | False | 82.0 | 19830.0 | 3538.0 | 7393.0 | 0.479 | ... | 0.564 | 34.1 | 29.8 | 118.9 | 44.0 | 38.0 | 8.250557 | NaN | NaN | no |
| 40 | 2023 | NBA | houston rockets | HOU | False | 82.0 | 19755.0 | 3329.0 | 7286.0 | 0.457 | ... | 0.530 | 32.9 | 22.4 | 110.7 | 22.0 | 60.0 | -9.554702 | NaN | NaN | no |
| 41 | 2023 | NBA | indiana pacers | IND | False | 82.0 | 19755.0 | 3444.0 | 7345.0 | 0.469 | ... | 0.540 | 31.4 | 27.0 | 116.3 | 35.0 | 47.0 | -0.748703 | NaN | NaN | no |
| 42 | 2023 | NBA | los angeles clippers | LAC | False | 82.0 | 19830.0 | 3370.0 | 7059.0 | 0.477 | ... | 0.539 | 33.4 | 23.9 | 113.6 | 44.0 | 38.0 | 5.304915 | NaN | NaN | no |
| 43 | 2023 | NBA | los angeles lakers | LAL | False | 82.0 | 19880.0 | 3516.0 | 7298.0 | 0.482 | ... | 0.555 | 35.7 | 25.3 | 117.2 | 43.0 | 39.0 | 7.100823 | NaN | NaN | no |
| 44 | 2023 | NBA | memphis grizzlies | MEM | False | 82.0 | 19780.0 | 3585.0 | 7551.0 | 0.475 | ... | 0.548 | 34.6 | 26.0 | 116.9 | 51.0 | 31.0 | 9.904935 | NaN | NaN | no |
| 45 | 2023 | NBA | miami heat | MIA | False | 82.0 | 19805.0 | 3215.0 | 6991.0 | 0.460 | ... | 0.540 | 30.9 | 23.8 | 109.5 | 44.0 | 38.0 | 1.289634 | NaN | NaN | no |
| 46 | 2023 | NBA | milwaukee bucks | MIL | False | 82.0 | 19830.0 | 3504.0 | 7411.0 | 0.473 | ... | 0.557 | 37.5 | 25.8 | 116.9 | 58.0 | 24.0 | 11.618075 | NaN | NaN | no |
| 47 | 2023 | NBA | minnesota timberwolves | MIN | False | 82.0 | 19830.0 | 3515.0 | 7167.0 | 0.490 | ... | 0.568 | 32.8 | 26.2 | 115.8 | 42.0 | 40.0 | 5.278023 | NaN | NaN | no |
| 48 | 2023 | NBA | new orleans pelicans | NOP | False | 82.0 | 19855.0 | 3447.0 | 7180.0 | 0.480 | ... | 0.541 | 33.1 | 25.9 | 114.4 | 42.0 | 40.0 | 6.503432 | NaN | NaN | no |
| 49 | 2023 | NBA | new york knicks | NYK | False | 82.0 | 19955.0 | 3444.0 | 7328.0 | 0.470 | ... | 0.547 | 34.0 | 22.9 | 116.0 | 47.0 | 35.0 | 6.553436 | NaN | NaN | no |
50 rows × 53 columns
final_team_df.columns.tolist()
['SEASON', 'LEAGUE_x', 'TEAM', 'ABBREVIATION_x', 'PLAYOFFS_x', 'GAMES_x', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOALS_ATTEMPTED', 'FIELD_GOALS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROWS_ATTEMPTED', 'FREE_THROW_PERCENTAGE_x', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'ASSISTS', 'STEALS', 'BLOCKS', 'TURNOVERS', 'POINTS', 'L_x', 'PLAYOFFS_y', 'AGE', 'PW', 'PL', 'MARGIN_OF_VICTORY', 'STRENGTH_OF_SCHEDULE', 'SIMPLE_RATING_SYSTEM', 'OFFENSIVE_RATING', 'DEFENSIVE_RATING', 'NET_RATING', 'TRUE_SHOOTING_PERCENTAGE', 'EFFECTIEV_FIELD_GOAL_PERCENTAGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL', 'OPPONENT_FREE_THROW_PERCENTAGE.1', 'LEAGUE', 'ABBREVIATION', 'PLAYOFFS', 'fg_per_game', 'FIELD_GOAL_PERCENTAGE', 'TWO_POINTER_PERCENTAGE', 'DEFENSIVE_REBOUND_PER_GAME', 'ASSIST_PER_GAME', 'POINTS_PER_GAME', 'W', 'L', 'Composite Score', 'YEAR', 'CHAMPIONSHIP_TEAM', 'CHAMPION']
Dropping some columns that might not be relevant
final_team_df.drop('LEAGUE', axis=1, inplace=True)
final_team_df.drop('LEAGUE_x', axis=1, inplace=True)
final_team_df.drop('ABBREVIATION', axis=1, inplace=True)
final_team_df.drop('ABBREVIATION_x', axis=1, inplace=True)
final_team_df.drop('PLAYOFFS_x', axis=1, inplace=True)
final_team_df.drop('PLAYOFFS_y', axis=1, inplace=True)
final_team_df.dtypes
SEASON int64 TEAM object GAMES_x float64 MINUTES_PLAYED float64 FIELD_GOALS_MADE float64 FIELD_GOALS_ATTEMPTED float64 FIELD_GOALS_PERCENTAGE float64 TWO_POINTERS_MADE float64 TWO_POINTERS_PERCENTAGE float64 FREE_THROWS_MADE float64 FREE_THROWS_ATTEMPTED float64 FREE_THROW_PERCENTAGE_x float64 DEFENSIVE_REBOUND float64 TOTAL_REBOUNDS float64 ASSISTS float64 STEALS float64 BLOCKS float64 TURNOVERS float64 POINTS float64 L_x float64 AGE float64 PW float64 PL float64 MARGIN_OF_VICTORY float64 STRENGTH_OF_SCHEDULE float64 SIMPLE_RATING_SYSTEM float64 OFFENSIVE_RATING float64 DEFENSIVE_RATING float64 NET_RATING float64 TRUE_SHOOTING_PERCENTAGE float64 EFFECTIEV_FIELD_GOAL_PERCENTAGE float64 TURNOVER_PERCENTAGE float64 OPPONENT_EFFECTIVE_FIELD_GAL float64 OPPONENT_FREE_THROW_PERCENTAGE.1 float64 PLAYOFFS object fg_per_game float64 FIELD_GOAL_PERCENTAGE float64 TWO_POINTER_PERCENTAGE float64 DEFENSIVE_REBOUND_PER_GAME float64 ASSIST_PER_GAME float64 POINTS_PER_GAME float64 W float64 L float64 Composite Score float64 YEAR float64 CHAMPIONSHIP_TEAM object CHAMPION object dtype: object
We will be applying One Hot Encoding to categorical data to make sure our dataframe is mainly numeric.
encoded_final_team_df = pd.get_dummies(final_team_df, columns=['PLAYOFFS', 'CHAMPION'], drop_first= True)
encoded_final_team_df.rename(columns={'PLAYOFFS_True': 'PLAYOFFS', 'CHAMPION_yes' : 'CHAMPION'}, inplace=True)
# Convert boolean columns to integer (0 and 1)
encoded_final_team_df['PLAYOFFS'] = encoded_final_team_df['PLAYOFFS'].astype(int)
encoded_final_team_df['CHAMPION'] = encoded_final_team_df['CHAMPION'].astype(int)
encoded_final_team_df
| SEASON | TEAM | GAMES_x | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOALS_ATTEMPTED | FIELD_GOALS_PERCENTAGE | TWO_POINTERS_MADE | TWO_POINTERS_PERCENTAGE | FREE_THROWS_MADE | ... | DEFENSIVE_REBOUND_PER_GAME | ASSIST_PER_GAME | POINTS_PER_GAME | W | L | Composite Score | YEAR | CHAMPIONSHIP_TEAM | PLAYOFFS | CHAMPION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024 | atlanta hawks | 55.0 | 13300.000000 | 2397.0 | 5144.0 | 0.466 | 1641.0 | 0.539 | 1123.0 | ... | 31.900000 | 26.2 | 121.3 | 24.0 | 31.0 | -7.188061 | NaN | NaN | 0 | 0 |
| 1 | 2024 | boston celtics | 55.0 | 13325.000000 | 2391.0 | 4967.0 | 0.481 | 1498.0 | 0.572 | 962.0 | ... | 36.600000 | 26.2 | 120.7 | 43.0 | 12.0 | 9.512361 | NaN | NaN | 0 | 0 |
| 2 | 2024 | brooklyn nets | 54.0 | 13035.000000 | 2257.0 | 4902.0 | 0.460 | 1510.0 | 0.525 | 860.0 | ... | 32.800000 | 26.8 | 113.4 | 21.0 | 33.0 | -9.151929 | NaN | NaN | 0 | 0 |
| 3 | 2024 | chicago bulls | 55.0 | 13400.000000 | 2292.0 | 4904.0 | 0.467 | 1644.0 | 0.529 | 917.0 | ... | 32.400000 | 24.3 | 111.8 | 26.0 | 29.0 | -6.050632 | NaN | NaN | 0 | 0 |
| 4 | 2024 | charlotte hornets | 54.0 | 13010.000000 | 2206.0 | 4790.0 | 0.461 | 1559.0 | 0.519 | 808.0 | ... | 31.000000 | 24.9 | 108.6 | 13.0 | 41.0 | -19.029235 | NaN | NaN | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1600 | 1948 | new york knicks | 48.0 | 19400.166667 | 1355.0 | 4724.0 | 0.287 | 1355.0 | 0.287 | 868.0 | ... | 30.509804 | 7.8 | 74.5 | 26.0 | 22.0 | -21.047678 | NaN | NaN | 1 | 0 |
| 1601 | 1948 | washington capitols | 48.0 | 20235.000000 | 1336.0 | 4785.0 | 0.279 | 1336.0 | 0.279 | 865.0 | ... | 36.900000 | 6.4 | 73.7 | 28.0 | 20.0 | -23.873765 | NaN | NaN | 1 | 0 |
| 1602 | 1947 | boston celtics | 60.0 | 19403.083333 | 1397.0 | 5133.0 | 0.272 | 1397.0 | 0.272 | 811.0 | ... | 31.650980 | 7.8 | 60.1 | 22.0 | 38.0 | -29.726828 | NaN | NaN | 0 | 0 |
| 1603 | 1947 | new york knicks | 60.0 | 19400.166667 | 1465.0 | 5255.0 | 0.279 | 1465.0 | 0.279 | 951.0 | ... | 30.509804 | 7.6 | 64.7 | 33.0 | 27.0 | -23.842757 | NaN | NaN | 1 | 0 |
| 1604 | 1947 | washington capitols | 60.0 | 20235.000000 | 1723.0 | 5794.0 | 0.297 | 1723.0 | 0.297 | 982.0 | ... | 36.900000 | 6.3 | 73.8 | 49.0 | 11.0 | -13.805353 | NaN | NaN | 1 | 0 |
1605 rows × 47 columns
We removed all rows for 2024 as taht were concerned with data from 2024 as this is the year we will be predicting at the end of this project.
encoded_final_team_df = encoded_final_team_df[encoded_final_team_df['SEASON'] != 2024]
encoded_final_team_df.columns.tolist()
['SEASON', 'TEAM', 'GAMES_x', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOALS_ATTEMPTED', 'FIELD_GOALS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROWS_ATTEMPTED', 'FREE_THROW_PERCENTAGE_x', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'ASSISTS', 'STEALS', 'BLOCKS', 'TURNOVERS', 'POINTS', 'L_x', 'AGE', 'PW', 'PL', 'MARGIN_OF_VICTORY', 'STRENGTH_OF_SCHEDULE', 'SIMPLE_RATING_SYSTEM', 'OFFENSIVE_RATING', 'DEFENSIVE_RATING', 'NET_RATING', 'TRUE_SHOOTING_PERCENTAGE', 'EFFECTIEV_FIELD_GOAL_PERCENTAGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL', 'OPPONENT_FREE_THROW_PERCENTAGE.1', 'fg_per_game', 'FIELD_GOAL_PERCENTAGE', 'TWO_POINTER_PERCENTAGE', 'DEFENSIVE_REBOUND_PER_GAME', 'ASSIST_PER_GAME', 'POINTS_PER_GAME', 'W', 'L', 'Composite Score', 'YEAR', 'CHAMPIONSHIP_TEAM', 'PLAYOFFS', 'CHAMPION']
Calculating Information Gain to find which values are more likely to be related to a team's win of the championship.
def information_gain(data, feature, target):
# Calculate the entropy of the data before the split
entropy_before = entropy(data[target])
# Calculate the entropy of the data after the split
grouped_entropy = data.groupby(feature)[target].apply(entropy)
entropy_after = (data.groupby(feature).size() / len(data)) @ grouped_entropy
# Calculate the information gain
information_gain = entropy_before - entropy_after
return information_gain
def entropy(data):
# Calculate the probability of each class
probabilities = data.value_counts(normalize=True)
# Calculate the entropy
entropy = - (probabilities * probabilities.apply(math.log2)).sum()
return entropy
# Example DataFrame
encoded_final_team_df = pd.DataFrame(encoded_final_team_df) # Your DataFrame here
# Calculate the information gain for each column against 'CHAMPION'
target_feature = 'CHAMPION'
information_gains = {}
for column in encoded_final_team_df.columns:
if column != target_feature: # Skip the target feature itself
information_gains[column] = information_gain(encoded_final_team_df, column, target_feature)
# Print the information gains
for column, information_gain in information_gains.items():
print(f"Information gain for '{column}' against '{target_feature}': {information_gain}")
Information gain for 'SEASON' against 'CHAMPION': 0.014950515682800036 Information gain for 'TEAM' against 'CHAMPION': 0.05767933757707558 Information gain for 'GAMES_x' against 'CHAMPION': 0.011448743875787482 Information gain for 'MINUTES_PLAYED' against 'CHAMPION': 0.047207293700936154 Information gain for 'FIELD_GOALS_MADE' against 'CHAMPION': 0.20752560039814788 Information gain for 'FIELD_GOALS_ATTEMPTED' against 'CHAMPION': 0.23226885809553627 Information gain for 'FIELD_GOALS_PERCENTAGE' against 'CHAMPION': 0.08268211670982178 Information gain for 'TWO_POINTERS_MADE' against 'CHAMPION': 0.2212974000627263 Information gain for 'TWO_POINTERS_PERCENTAGE' against 'CHAMPION': 0.08535481706440465 Information gain for 'FREE_THROWS_MADE' against 'CHAMPION': 0.18479791598779124 Information gain for 'FREE_THROWS_ATTEMPTED' against 'CHAMPION': 0.20856399533363978 Information gain for 'FREE_THROW_PERCENTAGE_x' against 'CHAMPION': 0.05697120368231767 Information gain for 'DEFENSIVE_REBOUND' against 'CHAMPION': 0.16205761003546354 Information gain for 'TOTAL_REBOUNDS' against 'CHAMPION': 0.21047493558169977 Information gain for 'ASSISTS' against 'CHAMPION': 0.18527630668706338 Information gain for 'STEALS' against 'CHAMPION': 0.13934322636949498 Information gain for 'BLOCKS' against 'CHAMPION': 0.12884232380171226 Information gain for 'TURNOVERS' against 'CHAMPION': 0.16491837011046245 Information gain for 'POINTS' against 'CHAMPION': 0.22373244956343707 Information gain for 'L_x' against 'CHAMPION': 0.10928455726939384 Information gain for 'AGE' against 'CHAMPION': 0.06424082812056756 Information gain for 'PW' against 'CHAMPION': 0.0962447338650389 Information gain for 'PL' against 'CHAMPION': 0.09290193883833536 Information gain for 'MARGIN_OF_VICTORY' against 'CHAMPION': 0.2476115863726033 Information gain for 'STRENGTH_OF_SCHEDULE' against 'CHAMPION': 0.09360754023198553 Information gain for 'SIMPLE_RATING_SYSTEM' against 'CHAMPION': 0.22808467367281696 Information gain for 'OFFENSIVE_RATING' against 'CHAMPION': 0.11167737066739572 Information gain for 'DEFENSIVE_RATING' against 'CHAMPION': 0.13150596965091188 Information gain for 'NET_RATING' against 'CHAMPION': 0.13423877868743925 Information gain for 'TRUE_SHOOTING_PERCENTAGE' against 'CHAMPION': 0.07878275069041518 Information gain for 'EFFECTIEV_FIELD_GOAL_PERCENTAGE' against 'CHAMPION': 0.08628835913075475 Information gain for 'TURNOVER_PERCENTAGE' against 'CHAMPION': 0.049494424878997134 Information gain for 'OPPONENT_EFFECTIVE_FIELD_GAL' against 'CHAMPION': 0.09005161487932403 Information gain for 'OPPONENT_FREE_THROW_PERCENTAGE.1' against 'CHAMPION': 0.07570697757327138 Information gain for 'fg_per_game' against 'CHAMPION': 0.09389155654177977 Information gain for 'FIELD_GOAL_PERCENTAGE' against 'CHAMPION': 0.08268211670982178 Information gain for 'TWO_POINTER_PERCENTAGE' against 'CHAMPION': 0.08535481706440465 Information gain for 'DEFENSIVE_REBOUND_PER_GAME' against 'CHAMPION': 0.07873905171513682 Information gain for 'ASSIST_PER_GAME' against 'CHAMPION': 0.07003492190767108 Information gain for 'POINTS_PER_GAME' against 'CHAMPION': 0.11811261893620706 Information gain for 'W' against 'CHAMPION': 0.11243597237104624 Information gain for 'L' against 'CHAMPION': 0.10928455726939384 Information gain for 'Composite Score' against 'CHAMPION': 0.2761954276479391 Information gain for 'YEAR' against 'CHAMPION': 0.2761954276479391 Information gain for 'CHAMPIONSHIP_TEAM' against 'CHAMPION': 0.2761954276479391 Information gain for 'PLAYOFFS' against 'CHAMPION': 0.03535872083113947
By scaling the information gain values, we mitigate the potential dominance of features with larger scales, allowing for a more balanced evaluation of each feature's importance in predicting championship outcomes. This normalization process enhances the robustness and interpretability of our model, enabling us to make informed decisions about which features to include in our regression analysis for more reliable championship predictions.
encoded_final_team_df.head(50)
| SEASON | TEAM | GAMES_x | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOALS_ATTEMPTED | FIELD_GOALS_PERCENTAGE | TWO_POINTERS_MADE | TWO_POINTERS_PERCENTAGE | FREE_THROWS_MADE | ... | DEFENSIVE_REBOUND_PER_GAME | ASSIST_PER_GAME | POINTS_PER_GAME | W | L | Composite Score | YEAR | CHAMPIONSHIP_TEAM | PLAYOFFS | CHAMPION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | 2023 | atlanta hawks | 82.0 | 19855.0 | 3658.0 | 7574.0 | 0.483 | 2776.0 | 0.548 | 1513.0 | ... | 33.2 | 25.0 | 118.4 | 41.0 | 41.0 | 5.039652 | NaN | NaN | 0 | 0 |
| 31 | 2023 | boston celtics | 82.0 | 19980.0 | 3460.0 | 7278.0 | 0.475 | 2145.0 | 0.567 | 1436.0 | ... | 35.6 | 26.7 | 117.9 | 57.0 | 25.0 | 14.503575 | NaN | NaN | 0 | 0 |
| 32 | 2023 | brooklyn nets | 82.0 | 19730.0 | 3399.0 | 6978.0 | 0.487 | 2351.0 | 0.559 | 1449.0 | ... | 32.3 | 25.5 | 113.4 | 45.0 | 37.0 | 6.229818 | NaN | NaN | 0 | 0 |
| 33 | 2023 | chicago bulls | 82.0 | 19905.0 | 3488.0 | 7116.0 | 0.490 | 2634.0 | 0.555 | 1446.0 | ... | 33.9 | 24.5 | 113.1 | 40.0 | 42.0 | 6.265528 | NaN | NaN | 0 | 0 |
| 34 | 2023 | charlotte hornets | 82.0 | 19830.0 | 3385.0 | 7413.0 | 0.457 | 2504.0 | 0.528 | 1447.0 | ... | 33.5 | 25.1 | 111.0 | 27.0 | 55.0 | -5.664596 | NaN | NaN | 0 | 0 |
| 35 | 2023 | cleveland cavaliers | 82.0 | 19880.0 | 3408.0 | 6984.0 | 0.488 | 2458.0 | 0.559 | 1439.0 | ... | 31.4 | 24.9 | 112.3 | 51.0 | 31.0 | 10.240028 | NaN | NaN | 0 | 0 |
| 36 | 2023 | dallas mavericks | 82.0 | 19930.0 | 3283.0 | 6909.0 | 0.475 | 2037.0 | 0.574 | 1554.0 | ... | 31.2 | 22.9 | 114.2 | 38.0 | 44.0 | 3.161333 | NaN | NaN | 0 | 0 |
| 37 | 2023 | denver nuggets | 82.0 | 19755.0 | 3574.0 | 7088.0 | 0.504 | 2605.0 | 0.575 | 1378.0 | ... | 32.9 | 28.9 | 115.8 | 53.0 | 29.0 | 11.046367 | 2023.0 | denver nuggets | 0 | 1 |
| 38 | 2023 | detroit pistons | 82.0 | 19805.0 | 3244.0 | 7140.0 | 0.454 | 2310.0 | 0.516 | 1623.0 | ... | 31.3 | 23.0 | 110.3 | 17.0 | 65.0 | -10.766249 | NaN | NaN | 0 | 0 |
| 39 | 2023 | golden state warriors | 82.0 | 19830.0 | 3538.0 | 7393.0 | 0.479 | 2175.0 | 0.564 | 1314.0 | ... | 34.1 | 29.8 | 118.9 | 44.0 | 38.0 | 8.250557 | NaN | NaN | 0 | 0 |
| 40 | 2023 | houston rockets | 82.0 | 19755.0 | 3329.0 | 7286.0 | 0.457 | 2473.0 | 0.530 | 1567.0 | ... | 32.9 | 22.4 | 110.7 | 22.0 | 60.0 | -9.554702 | NaN | NaN | 0 | 0 |
| 41 | 2023 | indiana pacers | 82.0 | 19755.0 | 3444.0 | 7345.0 | 0.469 | 2332.0 | 0.540 | 1535.0 | ... | 31.4 | 27.0 | 116.3 | 35.0 | 47.0 | -0.748703 | NaN | NaN | 0 | 0 |
| 42 | 2023 | los angeles clippers | 82.0 | 19830.0 | 3370.0 | 7059.0 | 0.477 | 2329.0 | 0.539 | 1533.0 | ... | 33.4 | 23.9 | 113.6 | 44.0 | 38.0 | 5.304915 | NaN | NaN | 0 | 0 |
| 43 | 2023 | los angeles lakers | 82.0 | 19880.0 | 3516.0 | 7298.0 | 0.482 | 2631.0 | 0.555 | 1691.0 | ... | 35.7 | 25.3 | 117.2 | 43.0 | 39.0 | 7.100823 | NaN | NaN | 0 | 0 |
| 44 | 2023 | memphis grizzlies | 82.0 | 19780.0 | 3585.0 | 7551.0 | 0.475 | 2600.0 | 0.548 | 1432.0 | ... | 34.6 | 26.0 | 116.9 | 51.0 | 31.0 | 9.904935 | NaN | NaN | 0 | 0 |
| 45 | 2023 | miami heat | 82.0 | 19805.0 | 3215.0 | 6991.0 | 0.460 | 2235.0 | 0.540 | 1567.0 | ... | 30.9 | 23.8 | 109.5 | 44.0 | 38.0 | 1.289634 | NaN | NaN | 0 | 0 |
| 46 | 2023 | milwaukee bucks | 82.0 | 19830.0 | 3504.0 | 7411.0 | 0.473 | 2287.0 | 0.557 | 1364.0 | ... | 37.5 | 25.8 | 116.9 | 58.0 | 24.0 | 11.618075 | NaN | NaN | 0 | 0 |
| 47 | 2023 | minnesota timberwolves | 82.0 | 19830.0 | 3515.0 | 7167.0 | 0.490 | 2518.0 | 0.568 | 1467.0 | ... | 32.8 | 26.2 | 115.8 | 42.0 | 40.0 | 5.278023 | NaN | NaN | 0 | 0 |
| 48 | 2023 | new orleans pelicans | 82.0 | 19855.0 | 3447.0 | 7180.0 | 0.480 | 2548.0 | 0.541 | 1585.0 | ... | 33.1 | 25.9 | 114.4 | 42.0 | 40.0 | 6.503432 | NaN | NaN | 0 | 0 |
| 49 | 2023 | new york knicks | 82.0 | 19955.0 | 3444.0 | 7328.0 | 0.470 | 2407.0 | 0.547 | 1589.0 | ... | 34.0 | 22.9 | 116.0 | 47.0 | 35.0 | 6.553436 | NaN | NaN | 0 | 0 |
| 50 | 2023 | oklahoma city thunder | 82.0 | 19855.0 | 3533.0 | 7590.0 | 0.465 | 2538.0 | 0.530 | 1572.0 | ... | 32.3 | 24.4 | 117.5 | 40.0 | 42.0 | 3.638798 | NaN | NaN | 0 | 0 |
| 51 | 2023 | orlando magic | 82.0 | 19780.0 | 3323.0 | 7074.0 | 0.470 | 2440.0 | 0.539 | 1607.0 | ... | 33.1 | 23.2 | 111.4 | 34.0 | 48.0 | -1.575875 | NaN | NaN | 0 | 0 |
| 52 | 2023 | philadelphia 76ers | 82.0 | 19880.0 | 3347.0 | 6870.0 | 0.487 | 2312.0 | 0.551 | 1719.0 | ... | 32.2 | 25.2 | 115.2 | 54.0 | 28.0 | 10.753173 | NaN | NaN | 0 | 0 |
| 53 | 2023 | phoenix suns | 82.0 | 19780.0 | 3453.0 | 7388.0 | 0.467 | 2452.0 | 0.520 | 1412.0 | ... | 32.4 | 27.3 | 113.6 | 45.0 | 37.0 | 6.281789 | NaN | NaN | 0 | 0 |
| 54 | 2023 | portland trail blazers | 82.0 | 19730.0 | 3317.0 | 7001.0 | 0.474 | 2261.0 | 0.550 | 1609.0 | ... | 31.1 | 24.2 | 113.4 | 33.0 | 49.0 | -2.474210 | NaN | NaN | 0 | 0 |
| 55 | 2023 | sacramento kings | 82.0 | 19830.0 | 3573.0 | 7232.0 | 0.494 | 2445.0 | 0.586 | 1624.0 | ... | 32.9 | 27.3 | 120.7 | 48.0 | 34.0 | 9.562584 | NaN | NaN | 0 | 0 |
| 56 | 2023 | san antonio spurs | 82.0 | 19855.0 | 3533.0 | 7593.0 | 0.465 | 2622.0 | 0.529 | 1292.0 | ... | 31.9 | 27.2 | 113.0 | 22.0 | 60.0 | -10.049878 | NaN | NaN | 0 | 0 |
| 57 | 2023 | toronto raptors | 82.0 | 19805.0 | 3434.0 | 7489.0 | 0.459 | 2554.0 | 0.525 | 1506.0 | ... | 30.3 | 23.9 | 112.9 | 41.0 | 41.0 | 3.093160 | NaN | NaN | 0 | 0 |
| 58 | 2023 | utah jazz | 82.0 | 19805.0 | 3485.0 | 7365.0 | 0.473 | 2391.0 | 0.560 | 1536.0 | ... | 34.1 | 26.0 | 117.1 | 37.0 | 45.0 | 3.838556 | NaN | NaN | 0 | 0 |
| 59 | 2023 | washington wizards | 82.0 | 19755.0 | 3456.0 | 7126.0 | 0.485 | 2531.0 | 0.559 | 1442.0 | ... | 34.2 | 25.4 | 113.2 | 35.0 | 47.0 | 2.969104 | NaN | NaN | 0 | 0 |
| 60 | 2022 | atlanta hawks | 82.0 | 19705.0 | 3401.0 | 7241.0 | 0.470 | 2345.0 | 0.531 | 1485.0 | ... | 33.9 | 24.6 | 113.9 | 43.0 | 39.0 | 5.432341 | NaN | NaN | 1 | 0 |
| 61 | 2022 | boston celtics | 82.0 | 19905.0 | 3341.0 | 7167.0 | 0.466 | 2256.0 | 0.547 | 1397.0 | ... | 35.5 | 24.8 | 111.8 | 51.0 | 31.0 | 13.206362 | NaN | NaN | 1 | 0 |
| 62 | 2022 | brooklyn nets | 82.0 | 19755.0 | 3442.0 | 7251.0 | 0.475 | 2502.0 | 0.538 | 1434.0 | ... | 34.1 | 25.3 | 112.9 | 44.0 | 38.0 | 6.103811 | NaN | NaN | 1 | 0 |
| 63 | 2022 | chicago bulls | 82.0 | 19730.0 | 3422.0 | 7127.0 | 0.480 | 2550.0 | 0.535 | 1436.0 | ... | 33.7 | 23.9 | 111.6 | 46.0 | 36.0 | 3.285980 | NaN | NaN | 1 | 0 |
| 64 | 2022 | charlotte hornets | 82.0 | 19880.0 | 3508.0 | 7497.0 | 0.468 | 2365.0 | 0.542 | 1298.0 | ... | 33.7 | 28.1 | 115.3 | 43.0 | 39.0 | 4.695518 | NaN | NaN | 0 | 0 |
| 65 | 2022 | cleveland cavaliers | 82.0 | 19730.0 | 3255.0 | 6940.0 | 0.469 | 2302.0 | 0.541 | 1376.0 | ... | 34.0 | 25.2 | 107.8 | 44.0 | 38.0 | 5.081598 | NaN | NaN | 0 | 0 |
| 66 | 2022 | dallas mavericks | 82.0 | 19755.0 | 3222.0 | 6982.0 | 0.461 | 2149.0 | 0.548 | 1341.0 | ... | 33.8 | 23.4 | 108.0 | 52.0 | 30.0 | 6.743320 | NaN | NaN | 1 | 0 |
| 67 | 2022 | denver nuggets | 82.0 | 19805.0 | 3416.0 | 7079.0 | 0.483 | 2377.0 | 0.575 | 1372.0 | ... | 34.9 | 27.8 | 112.7 | 48.0 | 34.0 | 8.835371 | NaN | NaN | 1 | 0 |
| 68 | 2022 | detroit pistons | 82.0 | 19780.0 | 3129.0 | 7267.0 | 0.431 | 2204.0 | 0.498 | 1413.0 | ... | 32.0 | 23.5 | 104.8 | 23.0 | 59.0 | -11.534968 | NaN | NaN | 0 | 0 |
| 69 | 2022 | golden state warriors | 82.0 | 19730.0 | 3323.0 | 7087.0 | 0.469 | 2147.0 | 0.557 | 1280.0 | ... | 35.7 | 27.1 | 111.0 | 53.0 | 29.0 | 11.684598 | 2022.0 | golden state warriors | 1 | 1 |
| 70 | 2022 | houston rockets | 82.0 | 19755.0 | 3229.0 | 7083.0 | 0.456 | 2124.0 | 0.543 | 1434.0 | ... | 32.4 | 23.6 | 109.7 | 20.0 | 62.0 | -9.883424 | NaN | NaN | 0 | 0 |
| 71 | 2022 | indiana pacers | 82.0 | 19880.0 | 3398.0 | 7338.0 | 0.463 | 2401.0 | 0.541 | 1347.0 | ... | 32.6 | 25.4 | 111.5 | 25.0 | 57.0 | -2.462554 | NaN | NaN | 0 | 0 |
| 72 | 2022 | los angeles clippers | 82.0 | 19780.0 | 3285.0 | 7170.0 | 0.458 | 2238.0 | 0.512 | 1273.0 | ... | 34.9 | 24.0 | 108.4 | 42.0 | 40.0 | 2.693561 | NaN | NaN | 0 | 0 |
| 73 | 2022 | los angeles lakers | 82.0 | 19980.0 | 3415.0 | 7279.0 | 0.469 | 2433.0 | 0.546 | 1380.0 | ... | 34.5 | 24.0 | 112.1 | 33.0 | 49.0 | 0.602110 | NaN | NaN | 0 | 0 |
| 74 | 2022 | memphis grizzlies | 82.0 | 19780.0 | 3571.0 | 7739.0 | 0.461 | 2626.0 | 0.519 | 1393.0 | ... | 35.0 | 26.0 | 115.6 | 56.0 | 26.0 | 11.182765 | NaN | NaN | 1 | 0 |
| 75 | 2022 | miami heat | 82.0 | 19855.0 | 3246.0 | 6954.0 | 0.467 | 2132.0 | 0.531 | 1416.0 | ... | 33.9 | 25.5 | 110.0 | 53.0 | 29.0 | 9.139626 | NaN | NaN | 1 | 0 |
| 76 | 2022 | milwaukee bucks | 82.0 | 19755.0 | 3429.0 | 7331.0 | 0.468 | 2276.0 | 0.544 | 1459.0 | ... | 36.5 | 23.9 | 115.5 | 51.0 | 31.0 | 9.273393 | NaN | NaN | 1 | 0 |
| 77 | 2022 | minnesota timberwolves | 82.0 | 19780.0 | 3411.0 | 7458.0 | 0.457 | 2200.0 | 0.540 | 1474.0 | ... | 32.9 | 25.7 | 115.9 | 46.0 | 36.0 | 6.222438 | NaN | NaN | 1 | 0 |
| 78 | 2022 | new orleans pelicans | 82.0 | 19755.0 | 3294.0 | 7212.0 | 0.457 | 2421.0 | 0.528 | 1501.0 | ... | 33.2 | 25.0 | 109.3 | 36.0 | 46.0 | 0.240126 | NaN | NaN | 1 | 0 |
| 79 | 2022 | new york knicks | 82.0 | 19780.0 | 3088.0 | 7069.0 | 0.437 | 2006.0 | 0.497 | 1473.0 | ... | 34.6 | 21.9 | 106.5 | 37.0 | 45.0 | -0.549704 | NaN | NaN | 0 | 0 |
50 rows × 47 columns
We used a Label Encoder to make our Team and Championship Team columns to be numerical. Note that we didn't use regular One Hot Encoding here as there are way more than 2 categories for each of these columns.
# Initialize LabelEncoder
le = LabelEncoder()
# Fit and transform the 'TEAM' and 'CHAMPIONSHIP_TEAM' columns
encoded_final_team_df['TEAM'] = le.fit_transform(encoded_final_team_df['TEAM'])
encoded_final_team_df['CHAMPIONSHIP_TEAM'] = le.fit_transform(encoded_final_team_df['CHAMPIONSHIP_TEAM'])
# Print the first 50 rows of the dataframe
encoded_final_team_df.head(50)
| SEASON | TEAM | GAMES_x | MINUTES_PLAYED | FIELD_GOALS_MADE | FIELD_GOALS_ATTEMPTED | FIELD_GOALS_PERCENTAGE | TWO_POINTERS_MADE | TWO_POINTERS_PERCENTAGE | FREE_THROWS_MADE | ... | DEFENSIVE_REBOUND_PER_GAME | ASSIST_PER_GAME | POINTS_PER_GAME | W | L | Composite Score | YEAR | CHAMPIONSHIP_TEAM | PLAYOFFS | CHAMPION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | 2023 | 0 | 82.0 | 19855.0 | 3658.0 | 7574.0 | 0.483 | 2776.0 | 0.548 | 1513.0 | ... | 33.2 | 25.0 | 118.4 | 41.0 | 41.0 | 5.039652 | NaN | 24 | 0 | 0 |
| 31 | 2023 | 1 | 82.0 | 19980.0 | 3460.0 | 7278.0 | 0.475 | 2145.0 | 0.567 | 1436.0 | ... | 35.6 | 26.7 | 117.9 | 57.0 | 25.0 | 14.503575 | NaN | 24 | 0 | 0 |
| 32 | 2023 | 2 | 82.0 | 19730.0 | 3399.0 | 6978.0 | 0.487 | 2351.0 | 0.559 | 1449.0 | ... | 32.3 | 25.5 | 113.4 | 45.0 | 37.0 | 6.229818 | NaN | 24 | 0 | 0 |
| 33 | 2023 | 8 | 82.0 | 19905.0 | 3488.0 | 7116.0 | 0.490 | 2634.0 | 0.555 | 1446.0 | ... | 33.9 | 24.5 | 113.1 | 40.0 | 42.0 | 6.265528 | NaN | 24 | 0 | 0 |
| 34 | 2023 | 7 | 82.0 | 19830.0 | 3385.0 | 7413.0 | 0.457 | 2504.0 | 0.528 | 1447.0 | ... | 33.5 | 25.1 | 111.0 | 27.0 | 55.0 | -5.664596 | NaN | 24 | 0 | 0 |
| 35 | 2023 | 9 | 82.0 | 19880.0 | 3408.0 | 6984.0 | 0.488 | 2458.0 | 0.559 | 1439.0 | ... | 31.4 | 24.9 | 112.3 | 51.0 | 31.0 | 10.240028 | NaN | 24 | 0 | 0 |
| 36 | 2023 | 11 | 82.0 | 19930.0 | 3283.0 | 6909.0 | 0.475 | 2037.0 | 0.574 | 1554.0 | ... | 31.2 | 22.9 | 114.2 | 38.0 | 44.0 | 3.161333 | NaN | 24 | 0 | 0 |
| 37 | 2023 | 12 | 82.0 | 19755.0 | 3574.0 | 7088.0 | 0.504 | 2605.0 | 0.575 | 1378.0 | ... | 32.9 | 28.9 | 115.8 | 53.0 | 29.0 | 11.046367 | 2023.0 | 4 | 0 | 1 |
| 38 | 2023 | 14 | 82.0 | 19805.0 | 3244.0 | 7140.0 | 0.454 | 2310.0 | 0.516 | 1623.0 | ... | 31.3 | 23.0 | 110.3 | 17.0 | 65.0 | -10.766249 | NaN | 24 | 0 | 0 |
| 39 | 2023 | 15 | 82.0 | 19830.0 | 3538.0 | 7393.0 | 0.479 | 2175.0 | 0.564 | 1314.0 | ... | 34.1 | 29.8 | 118.9 | 44.0 | 38.0 | 8.250557 | NaN | 24 | 0 | 0 |
| 40 | 2023 | 17 | 82.0 | 19755.0 | 3329.0 | 7286.0 | 0.457 | 2473.0 | 0.530 | 1567.0 | ... | 32.9 | 22.4 | 110.7 | 22.0 | 60.0 | -9.554702 | NaN | 24 | 0 | 0 |
| 41 | 2023 | 18 | 82.0 | 19755.0 | 3444.0 | 7345.0 | 0.469 | 2332.0 | 0.540 | 1535.0 | ... | 31.4 | 27.0 | 116.3 | 35.0 | 47.0 | -0.748703 | NaN | 24 | 0 | 0 |
| 42 | 2023 | 22 | 82.0 | 19830.0 | 3370.0 | 7059.0 | 0.477 | 2329.0 | 0.539 | 1533.0 | ... | 33.4 | 23.9 | 113.6 | 44.0 | 38.0 | 5.304915 | NaN | 24 | 0 | 0 |
| 43 | 2023 | 23 | 82.0 | 19880.0 | 3516.0 | 7298.0 | 0.482 | 2631.0 | 0.555 | 1691.0 | ... | 35.7 | 25.3 | 117.2 | 43.0 | 39.0 | 7.100823 | NaN | 24 | 0 | 0 |
| 44 | 2023 | 25 | 82.0 | 19780.0 | 3585.0 | 7551.0 | 0.475 | 2600.0 | 0.548 | 1432.0 | ... | 34.6 | 26.0 | 116.9 | 51.0 | 31.0 | 9.904935 | NaN | 24 | 0 | 0 |
| 45 | 2023 | 30 | 82.0 | 19805.0 | 3215.0 | 6991.0 | 0.460 | 2235.0 | 0.540 | 1567.0 | ... | 30.9 | 23.8 | 109.5 | 44.0 | 38.0 | 1.289634 | NaN | 24 | 0 | 0 |
| 46 | 2023 | 31 | 82.0 | 19830.0 | 3504.0 | 7411.0 | 0.473 | 2287.0 | 0.557 | 1364.0 | ... | 37.5 | 25.8 | 116.9 | 58.0 | 24.0 | 11.618075 | NaN | 24 | 0 | 0 |
| 47 | 2023 | 33 | 82.0 | 19830.0 | 3515.0 | 7167.0 | 0.490 | 2518.0 | 0.568 | 1467.0 | ... | 32.8 | 26.2 | 115.8 | 42.0 | 40.0 | 5.278023 | NaN | 24 | 0 | 0 |
| 48 | 2023 | 38 | 82.0 | 19855.0 | 3447.0 | 7180.0 | 0.480 | 2548.0 | 0.541 | 1585.0 | ... | 33.1 | 25.9 | 114.4 | 42.0 | 40.0 | 6.503432 | NaN | 24 | 0 | 0 |
| 49 | 2023 | 40 | 82.0 | 19955.0 | 3444.0 | 7328.0 | 0.470 | 2407.0 | 0.547 | 1589.0 | ... | 34.0 | 22.9 | 116.0 | 47.0 | 35.0 | 6.553436 | NaN | 24 | 0 | 0 |
| 50 | 2023 | 43 | 82.0 | 19855.0 | 3533.0 | 7590.0 | 0.465 | 2538.0 | 0.530 | 1572.0 | ... | 32.3 | 24.4 | 117.5 | 40.0 | 42.0 | 3.638798 | NaN | 24 | 0 | 0 |
| 51 | 2023 | 44 | 82.0 | 19780.0 | 3323.0 | 7074.0 | 0.470 | 2440.0 | 0.539 | 1607.0 | ... | 33.1 | 23.2 | 111.4 | 34.0 | 48.0 | -1.575875 | NaN | 24 | 0 | 0 |
| 52 | 2023 | 45 | 82.0 | 19880.0 | 3347.0 | 6870.0 | 0.487 | 2312.0 | 0.551 | 1719.0 | ... | 32.2 | 25.2 | 115.2 | 54.0 | 28.0 | 10.753173 | NaN | 24 | 0 | 0 |
| 53 | 2023 | 46 | 82.0 | 19780.0 | 3453.0 | 7388.0 | 0.467 | 2452.0 | 0.520 | 1412.0 | ... | 32.4 | 27.3 | 113.6 | 45.0 | 37.0 | 6.281789 | NaN | 24 | 0 | 0 |
| 54 | 2023 | 49 | 82.0 | 19730.0 | 3317.0 | 7001.0 | 0.474 | 2261.0 | 0.550 | 1609.0 | ... | 31.1 | 24.2 | 113.4 | 33.0 | 49.0 | -2.474210 | NaN | 24 | 0 | 0 |
| 55 | 2023 | 50 | 82.0 | 19830.0 | 3573.0 | 7232.0 | 0.494 | 2445.0 | 0.586 | 1624.0 | ... | 32.9 | 27.3 | 120.7 | 48.0 | 34.0 | 9.562584 | NaN | 24 | 0 | 0 |
| 56 | 2023 | 51 | 82.0 | 19855.0 | 3533.0 | 7593.0 | 0.465 | 2622.0 | 0.529 | 1292.0 | ... | 31.9 | 27.2 | 113.0 | 22.0 | 60.0 | -10.049878 | NaN | 24 | 0 | 0 |
| 57 | 2023 | 59 | 82.0 | 19805.0 | 3434.0 | 7489.0 | 0.459 | 2554.0 | 0.525 | 1506.0 | ... | 30.3 | 23.9 | 112.9 | 41.0 | 41.0 | 3.093160 | NaN | 24 | 0 | 0 |
| 58 | 2023 | 60 | 82.0 | 19805.0 | 3485.0 | 7365.0 | 0.473 | 2391.0 | 0.560 | 1536.0 | ... | 34.1 | 26.0 | 117.1 | 37.0 | 45.0 | 3.838556 | NaN | 24 | 0 | 0 |
| 59 | 2023 | 66 | 82.0 | 19755.0 | 3456.0 | 7126.0 | 0.485 | 2531.0 | 0.559 | 1442.0 | ... | 34.2 | 25.4 | 113.2 | 35.0 | 47.0 | 2.969104 | NaN | 24 | 0 | 0 |
| 60 | 2022 | 0 | 82.0 | 19705.0 | 3401.0 | 7241.0 | 0.470 | 2345.0 | 0.531 | 1485.0 | ... | 33.9 | 24.6 | 113.9 | 43.0 | 39.0 | 5.432341 | NaN | 24 | 1 | 0 |
| 61 | 2022 | 1 | 82.0 | 19905.0 | 3341.0 | 7167.0 | 0.466 | 2256.0 | 0.547 | 1397.0 | ... | 35.5 | 24.8 | 111.8 | 51.0 | 31.0 | 13.206362 | NaN | 24 | 1 | 0 |
| 62 | 2022 | 2 | 82.0 | 19755.0 | 3442.0 | 7251.0 | 0.475 | 2502.0 | 0.538 | 1434.0 | ... | 34.1 | 25.3 | 112.9 | 44.0 | 38.0 | 6.103811 | NaN | 24 | 1 | 0 |
| 63 | 2022 | 8 | 82.0 | 19730.0 | 3422.0 | 7127.0 | 0.480 | 2550.0 | 0.535 | 1436.0 | ... | 33.7 | 23.9 | 111.6 | 46.0 | 36.0 | 3.285980 | NaN | 24 | 1 | 0 |
| 64 | 2022 | 7 | 82.0 | 19880.0 | 3508.0 | 7497.0 | 0.468 | 2365.0 | 0.542 | 1298.0 | ... | 33.7 | 28.1 | 115.3 | 43.0 | 39.0 | 4.695518 | NaN | 24 | 0 | 0 |
| 65 | 2022 | 9 | 82.0 | 19730.0 | 3255.0 | 6940.0 | 0.469 | 2302.0 | 0.541 | 1376.0 | ... | 34.0 | 25.2 | 107.8 | 44.0 | 38.0 | 5.081598 | NaN | 24 | 0 | 0 |
| 66 | 2022 | 11 | 82.0 | 19755.0 | 3222.0 | 6982.0 | 0.461 | 2149.0 | 0.548 | 1341.0 | ... | 33.8 | 23.4 | 108.0 | 52.0 | 30.0 | 6.743320 | NaN | 24 | 1 | 0 |
| 67 | 2022 | 12 | 82.0 | 19805.0 | 3416.0 | 7079.0 | 0.483 | 2377.0 | 0.575 | 1372.0 | ... | 34.9 | 27.8 | 112.7 | 48.0 | 34.0 | 8.835371 | NaN | 24 | 1 | 0 |
| 68 | 2022 | 14 | 82.0 | 19780.0 | 3129.0 | 7267.0 | 0.431 | 2204.0 | 0.498 | 1413.0 | ... | 32.0 | 23.5 | 104.8 | 23.0 | 59.0 | -11.534968 | NaN | 24 | 0 | 0 |
| 69 | 2022 | 15 | 82.0 | 19730.0 | 3323.0 | 7087.0 | 0.469 | 2147.0 | 0.557 | 1280.0 | ... | 35.7 | 27.1 | 111.0 | 53.0 | 29.0 | 11.684598 | 2022.0 | 6 | 1 | 1 |
| 70 | 2022 | 17 | 82.0 | 19755.0 | 3229.0 | 7083.0 | 0.456 | 2124.0 | 0.543 | 1434.0 | ... | 32.4 | 23.6 | 109.7 | 20.0 | 62.0 | -9.883424 | NaN | 24 | 0 | 0 |
| 71 | 2022 | 18 | 82.0 | 19880.0 | 3398.0 | 7338.0 | 0.463 | 2401.0 | 0.541 | 1347.0 | ... | 32.6 | 25.4 | 111.5 | 25.0 | 57.0 | -2.462554 | NaN | 24 | 0 | 0 |
| 72 | 2022 | 22 | 82.0 | 19780.0 | 3285.0 | 7170.0 | 0.458 | 2238.0 | 0.512 | 1273.0 | ... | 34.9 | 24.0 | 108.4 | 42.0 | 40.0 | 2.693561 | NaN | 24 | 0 | 0 |
| 73 | 2022 | 23 | 82.0 | 19980.0 | 3415.0 | 7279.0 | 0.469 | 2433.0 | 0.546 | 1380.0 | ... | 34.5 | 24.0 | 112.1 | 33.0 | 49.0 | 0.602110 | NaN | 24 | 0 | 0 |
| 74 | 2022 | 25 | 82.0 | 19780.0 | 3571.0 | 7739.0 | 0.461 | 2626.0 | 0.519 | 1393.0 | ... | 35.0 | 26.0 | 115.6 | 56.0 | 26.0 | 11.182765 | NaN | 24 | 1 | 0 |
| 75 | 2022 | 30 | 82.0 | 19855.0 | 3246.0 | 6954.0 | 0.467 | 2132.0 | 0.531 | 1416.0 | ... | 33.9 | 25.5 | 110.0 | 53.0 | 29.0 | 9.139626 | NaN | 24 | 1 | 0 |
| 76 | 2022 | 31 | 82.0 | 19755.0 | 3429.0 | 7331.0 | 0.468 | 2276.0 | 0.544 | 1459.0 | ... | 36.5 | 23.9 | 115.5 | 51.0 | 31.0 | 9.273393 | NaN | 24 | 1 | 0 |
| 77 | 2022 | 33 | 82.0 | 19780.0 | 3411.0 | 7458.0 | 0.457 | 2200.0 | 0.540 | 1474.0 | ... | 32.9 | 25.7 | 115.9 | 46.0 | 36.0 | 6.222438 | NaN | 24 | 1 | 0 |
| 78 | 2022 | 38 | 82.0 | 19755.0 | 3294.0 | 7212.0 | 0.457 | 2421.0 | 0.528 | 1501.0 | ... | 33.2 | 25.0 | 109.3 | 36.0 | 46.0 | 0.240126 | NaN | 24 | 1 | 0 |
| 79 | 2022 | 40 | 82.0 | 19780.0 | 3088.0 | 7069.0 | 0.437 | 2006.0 | 0.497 | 1473.0 | ... | 34.6 | 21.9 | 106.5 | 37.0 | 45.0 | -0.549704 | NaN | 24 | 0 | 0 |
50 rows × 47 columns
Here we decide to fill in null or absent values with the mean for those values within that given year and for the specific team the value falls under. This is to avoid mixing up from other teams.
cols_to_fill = encoded_final_team_df.columns.tolist()
for col in cols_to_fill:
encoded_final_team_df[col] = encoded_final_team_df.groupby(['TEAM', 'SEASON'])[col].transform(lambda x: x.fillna(x.mean()))
encoded_final_team_df
encoded_final_team_df['CHAMPION'].value_counts()[1]
We use feature bagging, such as considering the square root of the total number of features, to determine the number of features to use in Random Forests. This technique involves randomly selecting a subset of features at each split, which helps prevent overfitting and increases the diversity among the trees in the ensemble. By dropping some columns based on their low information gain score relative to other columns, we prioritize the inclusion of features that contribute the most to the predictive power of the model while efficiently managing computational resources and reducing the risk of overfitting.
columns_to_drop = [
# 'SEASON',
# 'TEAM',
'GAMES_x',
'MINUTES_PLAYED',
# 'FIELD_GOALS_MADE',
# 'FIELD_GOALS_ATTEMPTED',
'FIELD_GOALS_PERCENTAGE',
# 'TWO_POINTERS_MADE',
'TWO_POINTERS_PERCENTAGE',
'FREE_THROWS_MADE',
# 'FREE_THROWS_ATTEMPTED',
'FREE_THROW_PERCENTAGE_x',
'DEFENSIVE_REBOUND',
# 'TOTAL_REBOUNDS',
'ASSISTS',
'STEALS',
'BLOCKS',
'TURNOVERS',
'POINTS',
'L_x',
'AGE',
'PW',
'PL',
# 'MARGIN_OF_VICTORY',
'STRENGTH_OF_SCHEDULE',
# 'SIMPLE_RATING_SYSTEM',
'OFFENSIVE_RATING',
'DEFENSIVE_RATING',
'NET_RATING',
'TRUE_SHOOTING_PERCENTAGE',
'EFFECTIEV_FIELD_GOAL_PERCENTAGE',
'TURNOVER_PERCENTAGE',
'OPPONENT_EFFECTIVE_FIELD_GAL',
'OPPONENT_FREE_THROW_PERCENTAGE.1',
'fg_per_game',
'FIELD_GOAL_PERCENTAGE',
'TWO_POINTER_PERCENTAGE',
'DEFENSIVE_REBOUND_PER_GAME',
'ASSIST_PER_GAME',
'POINTS_PER_GAME',
'W',
'L',
# 'Composite Score',
'YEAR',
'CHAMPIONSHIP_TEAM',
'PLAYOFFS',
# 'CHAMPION'
]
encoded_final_team_df.drop(columns_to_drop, axis=1, inplace=True)
Here we obtain the X (features) and y (target) variables necessary in our random forest classifier. Since Champion is the dependent variable we drop it from the dataframe before passing it to X. On the other hand it is the only column considered when submitting to the y variable.
#X = encoded_final_team_df.drop('CHAMPION', axis=1)
X = encoded_final_team_df.drop(
'CHAMPION', axis=1)
y = encoded_final_team_df['CHAMPION']
The dataset is then split into training and testing sets, 20% of the dataset is used as the testing set and 80% as the training set then we make 100 trees in our forest.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
modelF = RandomForestClassifier(n_estimators=100, random_state=55)
modelF.fit(X_train, y_train)
The accuracy and confusion matrix are obtained.
y_pred = modelF.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
A 5-fold cross validation is performed on our model, this is to provide a more reliable assessment on how our model performed then the cross-validation scores and the average cross-validation score are obtained.
from sklearn.model_selection import cross_val_score
# Perform 5-fold cross validation
scores = cross_val_score(modelF, X, y, cv=5)
# Print cross-validation scores
print('Cross-validation scores: ', scores)
# Print average cross-validation score
print('Average cross-validation score: ', scores.mean())
A classification report is obtained.
Let's break it down below:
Precision: Precision measures how many of the positive predictions made by the classifier are actually correct. It is calculated as the ratio of true positive predictions to the total number of positive predictions. In our case, the precision for class 1 (the positive class) is 0.40, which means that only 40% of the positive predictions were accurate.
Recall (Sensitivity): Recall, also known as sensitivity or true positive rate, measures how many of the actual positive instances were correctly predicted by the classifier. It is calculated as the ratio of true positive predictions to the total number of actual positive instances. In our case, the recall for class 1 is 0.18, indicating that only 18% of the actual positive instances were correctly identified.
F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. It is especially useful for imbalanced datasets where precision and recall may be skewed. The F1-score is calculated as follows:
$$ F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
For class 1, the F1-score is 0.25, which reflects the trade-off between precision and recall.
Support: The support represents the number of actual instances in each class. In your case, there are 11 instances of class 1.
In summary, our classifier performs well in terms of precision for class 0 (negative class), but its recall and F1-score for class 1 (positive class) are relatively low.
The low support (number of actual instances) can certainly impacts the recall.
When dealing with a small number of instances in the positive class (class 1), the classifier has fewer opportunities to correctly predict them. If the model misclassifies even a few instances, it significantly affects the recall because the denominator (total actual positive instances) is small. In our case, with only 11 instances of class 1, the model’s ability to capture them accurately is limited.
Imbalanced datasets (where one class has significantly fewer instances than the other) can lead to biased performance metrics. In such cases, the model may prioritize the majority class (class 0) due to its higher prevalence, resulting in lower recall for the minority class (class 1).
from sklearn.metrics import classification_report
# Get predictions
y_pred = modelF.predict(X_test)
# Print classification report
print(classification_report(y_test, y_pred))
We create a bar plot of the feature importances as determined by your Random Forest model. Feature importance gives a score for each feature of our data, the higher the score more important or relevant is the feature towards your output variable. This visualization helps us understand which features are driving the predictions of your model, which can be very useful in understanding the underlying processes that the model is using to make predictions.
importances = modelF.feature_importances_
# Get the index of importances from greatest importance to least
sorted_index = np.argsort(importances)[::-1]
x = range(len(sorted_index))
# Create tick labels
labels = np.array(X.columns)[sorted_index]
plt.bar(x, importances[sorted_index], tick_label=labels)
# Rotate tick labels to vertical
plt.xticks(rotation=90)
plt.show()
From the outset, our goal was clear and ambitious: predict the NBA champion for the upcoming year. Armed with data science tools and fueled by our passion for basketball, we embarked on this exhilarating journey.
We navigated the vast landscape of our dataset, delving into player statistics and team performance. Our journey led us through intricate details, where we wrangled and transformed raw data into a format our machine learning model could comprehend. Irrelevant noise was discarded, and significant patterns emerged from the numbers.
Now, standing at the precipice of discovery, we've completed rigorous data preprocessing, meticulous feature selection, and model training. The anticipation builds as we feed our processed data into our finely tuned model. This moment represents the culmination of our efforts—a transformation from raw data to a prediction for the future.
Without further ado, let the power of machine learning guide us.
present_Advanced = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/2024 Advanced.csv")
present_Total = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/2024 Total.csv")
present_Advanced.columns.tolist()
present_Advanced = present_Advanced.drop(['Rk',
'Unnamed: 18',
'Unnamed: 23',
'eFG%',
'TURNOVER_PERCENTAGE.1',
'DEFENSIVE_REBOUND_PERCENTAGE',
'FT/FGA',
'Unnamed: 28',
'Arena',
'Attend.',
'Attend./G'], axis = 1)
present_Total.columns.tolist()
present_Total = present_Total.drop('Rk', axis=1)
Our validation df is created by merging both 2024 dataframe the total dataframe and the advanced dataframe, this is done on the TEAM column.
Testing_df = present_Total.merge(present_Advanced, on='TEAM')
Testing_df.columns.tolist()
Calculating the composite score for the data we are about to test
numeric_cols = Testing_df.select_dtypes(include=np.number)
# Calculate correlation with a win
correlation_with_w = numeric_cols.apply(lambda x: x.corr(Testing_df['W']))
# Standardize numeric variables
normalized_variables = (numeric_cols - numeric_cols.mean()) / numeric_cols.std()
# Calculate the composite score, taking into account the direction of correlation
Testing_df['Composite Score'] = normalized_variables.multiply(correlation_with_w, axis=1).sum(axis=1)
We drop the columns that have the columns that dont match what was previously used to train our ML model above. This makes sure the features used in our random forest classification model match and we are able to pass it in for prediction.
Testing_df = Testing_df.drop([#'TEAM',
'PLAYOFFS_x',
'GAMES',
'MINUTES_PLAYED',
#'FIELD_GOALS_MADE',
# 'FIELD_GOALS_ATTEMPTED',
'FIELD_GOALS_PERCENTAGE',
'THREE_POINTERS_MADE',
'THREE_POINTERS_ATTEMPTED',
'THREE_POINTERS_PERCENTAGE',
# 'TWO_POINTERS_MADE',
'TWO_POINTERS_ATTEMPTED',
'TWO_POINTERS_PERCENTAGE',
'FREE_THROWS_MADE',
# 'FREE_THROWS_ATTEMPTED',
'FREE_THROWS_PERCENTAGE',
'OFFENSIVE_REBOUNDS',
'DEFENSIVE_REBOUNDS',
#'TOTAL_REBOUNDS',
'ASSISTS',
'STEALS',
'BLOCKS',
'TURNOVERS',
'PERSONAL_FOULS',
'POINTS',
'PLAYOFFS_y',
'AGE',
'W',
'L',
'PW',
'PL',
#'MARGIN_OF_VICTORY',
'STRENGTH_OF_SCHEDULE',
#'SIMPLE_RATING_SYSTEM',
'OFFENSIVE_RATING',
'DEFENSIVE_RATING',
'NET_RATING',
#'Composite Score',
'PACE',
'FREE_THROW_ATTEMPT_RATE',
'THREE_POINT_ATTEMPT_RATE',
'TRUE_SHOOTING_PERCENTAGE',
'EFFECTIVE_FIELD_GOAL_PERCENTAGE',
'TURNOVER_PERCENTAGE',
'OFFENSIVE_REBOUND_PERCENTAGE',
'FREE_THROWS_PER_FIELD_GOAL_ATTEMPT'], axis = 1)
A season column is added and filled up with the value 2024 representing the year.
Testing_df ['SEASON'] = 2024
Testing_df
encoded_final_team_df.columns.tolist()
Testing_df.columns.tolist()
Here we make the SEASON column the first column so that it matches the order our features occured in the training model.
cols = ['SEASON'] + [col for col in Testing_df.columns if col != 'SEASON']
Testing_df = Testing_df[cols]
Label encoding is applied to our TEAM column, this turns the values found in this column into a format that can be processed by our ML model.
#le = LabelEncoder()
Testing_df['TEAM'] = le.fit_transform(Testing_df['TEAM'])
Testing_df.head(50)
Our trained model is used for the long waited prediction we pass our dataframe "testing_df" into it.
encoded_final_team_df.columns.tolist()
Testing_df.columns.tolist()
predictions = modelF.predict(Testing_df)
We then observe that only one array index has value 1. This is a positive result. This 1 value indicates the team at that index is going to be the champion, and the others with 0 values are unfortunately not going to be able to get the long desired prize.
predictions
# Add the predictions as a new column to the DataFrame
Testing_df['predictions'] = predictions
Testing_df
When we filter our testing_df based on the prediction we can see that the winner is Team 1. Team 1 before we label encoded happens to be the Boston Celtics
# Filter the DataFrame where 'predictions' is 'yes'
filtered_df = Testing_df[Testing_df['predictions'] == 1]
# # Get the 'TEAM' value
team_value = filtered_df['TEAM'].values[0]
# Retrieve the original team name using the mapping dictionary
original_team_name = le.inverse_transform([team_value])
# Print the original team name
print("And our predicted winner is:", original_team_name)
# Export the first three decision trees from the forest
import graphviz
from sklearn.tree import export_graphviz
for i in range(3):
tree = modelF.estimators_[i]
dot_data = export_graphviz(tree,
feature_names=X_train.columns,
filled=True,
max_depth=2,
impurity=False,
proportion=True)
graph = graphviz.Source(dot_data)
graph.render(f'tree_{i}', format='png')
display(graph)
In our recent data science project, we leveraged historical data on player performance, opponent statistics, and team attributes to predict the 2024 NBA champion. Our extensive data cleaning and processing led us to identify the most relevant features for our predictive model. Interestingly, we found that features extracted from team-related data sources provided the most predictive power.
We utilized Random Forests as our primary machine learning model due to its ability to handle large datasets with numerous features while effectively managing overfitting. This approach enabled us to capture complex relationships within the data and make robust predictions about future championship outcomes.
Our model’s prediction pointed towards the Boston Celtics as the potential victors of the NBA 2024 championship. This insight underscores the effectiveness of our approach in analyzing historical trends and team performance metrics to identify strong contenders for the title.
In conclusion, our project yielded valuable insights into the factors influencing team success in professional basketball. Our focus on relevant features and the use of Random Forests as our predictive model allowed us to provide actionable insights for stakeholders and enthusiasts alike. While our model’s forecast favors the Boston Celtics, we acknowledge the dynamic nature of sports and the potential for unexpected outcomes. Nonetheless, our methodology demonstrates the potential of data-driven approaches in enhancing decision-making processes and gaining deeper insights into the complex dynamics of competitive sports.
This project was a great learning experience, helping us develop skills in data cleaning and processing. The prediction of the Boston Celtics as potential winners reaffirms our belief in the effectiveness of our work as they are currently up 1-0 in the NBA playoffs semi-final. We unearthed valuable insights and presented visualizations that could spur further inquiries and investigations, potentially leading to positive outcomes for basketball, such as improved popularity, performance, scouting, and player development.
While we wouldn’t advise using our prediction as a sure bet due to the preliminary nature of our model and the observed imprecision in the values, we hope that the outcome of this year’s NBA season reflects the results of our report. We believe the low scores for the positive outcome in our model are due to our data being unbalanced. For future research, we recommend considering other ensemble methods adapted for imbalanced data, such as Synthetic Minority Over-sampling Technique (SMOTE).
Looking ahead, we are hopeful that our work will contribute to the broader understanding of basketball dynamics and inspire further research in this exciting field.
!pip install nbconvert
%cd '/content/drive/My Drive/Colab Notebooks'
!jupyter nbconvert --to html Final-Project.ipynb